Time Series Dataset for Modeling and Forecasting of $N_2O$ in Wastewater Treatment
Laura Debel Hansen, Anju Rani, Mikkel Algren Stokholm-Bjerregaard, Peter Alexander Stentoft, Daniel Ortiz Arroyo, Petar Durdevic
TL;DR
The paper presents a two-year, high-resolution N2O time series dataset from a full-scale WWTP (2-minute sampling) to support data-driven modeling and forecasting of N2O in activated sludge processes. It details data collection via SCADA-connected cloud platforms, sensor hardware, airflow estimation, standardized variable naming, phase-code control inputs, and per-signal quality flags, capturing real-world operational variability. The dataset serves as a benchmark for ML/DL time-series forecasting under nonstationarity, seasonality, and other complexities, and includes extensive descriptive statistics and visualization to characterize data properties. This resource enables researchers and practitioners to develop, compare, and validate mitigation and control strategies for N2O emissions in wastewater treatment, with practical implications for monitoring and reducing greenhouse gas footprints.
Abstract
In this paper, we present two years of high-resolution nitrous oxide ($N_2O$) measurements for time series modeling and forecasting in wastewater treatment plants (WWTP). The dataset comprises frequent, real-time measurements from a full-scale WWTP, with a sample interval of 2 minutes, making it ideal for developing models for real-time operation and control. This comprehensive bio-chemical dataset includes detailed influent and effluent parameters, operational conditions, and environmental factors. Unlike existing datasets, it addresses the unique challenges of modeling $N_2O$, a potent greenhouse gas, providing a valuable resource for researchers to enhance predictive accuracy and control strategies in wastewater treatment processes. Additionally, this dataset significantly contributes to the fields of machine learning and deep learning time series forecasting by serving as a benchmark that mirrors the complexities of real-world processes, thus facilitating advancements in these domains. We provide a detailed description of the dataset along with a statistical analysis to highlight its characteristics, such as nonstationarity, nonnormality, seasonality, heteroscedasticity, structural breaks, asymmetric distributions, and intermittency, which are common in many real-world time series datasets and pose challenges for forecasting models.
