Public Dataset Collection for ML-Based Quality-of-Transmission Estimation

Summary

This project establishes a publicly available, large-scale dataset collection for machine-learning-based Quality-of-Transmission (QoT) estimation in elastic optical networks. Generated from physically grounded, event-driven network simulations, the datasets enable reproducible benchmarking of QoT classification and regression models across heterogeneous network scenarios. Beyond data release, the project introduces feature-wise and multi-categorical dataset visualizations that make dataset structure, bias, and coverage explicit. A central result is that dataset design choices such as topology, traffic profile, and provisioning assumptions can dominate both apparent model performance and cross-dataset generalization, making data-centric evaluation a first-order concern in optical-network ML.

Problem

Research on ML-based Quality-of-Transmission estimation has been fundamentally limited by the absence of shared, transparent benchmark datasets. Most prior work relies on proprietary or sparsely documented simulation data, preventing reproducibility, obscuring dataset bias, and making reported performance figures difficult to compare across studies.

Context

Within the BMBF-funded project AI-NET-PROTECT, this work addressed the dataset bottleneck by creating a public QoT dataset collection designed explicitly for benchmarking, comparison, and error analysis of ML-based QoT estimators under controlled yet realistic operating assumptions.

System

The project delivers a QoT dataset collection comprising four large-scale datasets, each generated from dynamic optical-network simulations using Fraunhofer HHI’s planning tool PLATON .

Each dataset contains >1.2 million labeled samples and supports both classification and regression tasks with targets including BER, OSNR, and SNR. Two complementary data representations are provided:

Lightpath-level datasets for classical and deep ML models
Network-state datasets enabling network-wide QoT estimation

The datasets systematically vary network topology, traffic profile, and transceiver operation mode, while maintaining a consistent schema and comprehensive datasheets.

Constraints

Physically grounded QoT labels via nonlinear channel modeling (Gaussian Noise (GN) model approximation)
Controlled variation of simulation assumptions to isolate dataset effects
Scalability to multi-million-sample datasets without manual labeling
Support for both lightpath-based and network-wide learning paradigms
Dataset transparency beyond aggregate performance metrics

Approach

An end-to-end data generation pipeline was designed and implemented, covering dynamic network simulation, service provisioning, and transformation into ML-ready dataset representations.

A central methodological element is the use of multi-categorical, feature-wise dataset visualizations, which enable qualitative assessment of dataset balance, coverage, and bias prior to model training. These visual analyses reveal how even minor changes in network assumptions can induce substantial shifts in data distributions—shifts that directly affect model behavior and cross-scenario generalization.

Together, these methods shift QoT model evaluation from a purely model-centric perspective toward a data-aware benchmarking process.

Outcome

The result is a public QoT benchmark dataset collection that has since been reused in multiple studies on network-wide QoT estimation, privacy-preserving learning, and cross-domain model transfer.

The project establishes that:

dataset choice alone can dominate reported ML performance,
models trained on narrowly distributed datasets fail to generalize across scenarios,
transparent dataset analysis is essential for interpretable and deployable optical-network ML.

By combining open data, structured documentation, and visualization-driven analysis, this work enables reliable, comparable, and explainable ML research for optical network design and operation.

Machine learning (ML)-assisted solutions for quality of transmission (QoT) estimation or classification have received significant attention in recent years. However, due to the unavailability of large and well-structured datasets, individual research groups need to create and use their own datasets for validating their proposed solutions. Therefore, the reported results (obtained using different datasets) are difficult to reproduce and hardly comparable. Regardless of this limitation, the unavailability of a technique to be followed by different research groups for the explainability of the dataset makes it even harder to validate the developed ML-assisted solutions across different papers. In this work, we present a publicly available dataset collection to open the problem of data-driven QoT estimation to the ML community. The dataset collection allows various solutions presented by different research groups to be compared. Furthermore, we present techniques to visualize and evaluate datasets for QoT estimation. The presented visualizations can also deliver deep insight into the error analysis of ML models. We apply these new methods to evaluate an artificial neural network on different datasets. The results show the relevance of the presented visualizations for comparing different approaches and different datasets. The proposed methods enable the comparison and validation of different ML-based solutions and published datasets.