Analyzing Spatio-Temporal Dynamics of Dissolved Oxygen for the River Thames using Superstatistical Methods and Machine Learning

Hankun He; Takuya Boehringer; Benjamin Schäfer; Kate Heppell; Christian Beck

Analyzing Spatio-Temporal Dynamics of Dissolved Oxygen for the River Thames using Superstatistical Methods and Machine Learning

Hankun He, Takuya Boehringer, Benjamin Schäfer, Kate Heppell, Christian Beck

TL;DR

Regression analysis incorporating various water quality indicators and temporal features identify the Light Gradient Boosting Machine as the best model for same-time prediction of dissolved oxygen and reveals that temperature, pH, and time of year play crucial roles in the predictions.

Abstract

By employing superstatistical methods and machine learning, we analyze time series data of water quality indicators for the River Thames, with a specific focus on the dynamics of dissolved oxygen. After detrending, the probability density functions of dissolved oxygen fluctuations exhibit heavy tails that are effectively modeled using $q$-Gaussian distributions. Our findings indicate that the multiplicative Empirical Mode Decomposition method stands out as the most effective detrending technique, yielding the highest log-likelihood in nearly all fittings. We also observe that the optimally fitted width parameter of the $q$-Gaussian shows a negative correlation with the distance to the sea, highlighting the influence of geographical factors on water quality dynamics. In the context of same-time prediction of dissolved oxygen, regression analysis incorporating various water quality indicators and temporal features identify the Light Gradient Boosting Machine as the best model. SHapley Additive exPlanations reveal that temperature, pH, and time of year play crucial roles in the predictions. Furthermore, we use the Transformer to forecast dissolved oxygen concentrations. For long-term forecasting, the Informer model consistently delivers superior performance, achieving the lowest MAE and SMAPE with the 192 historical time steps that we used. This performance is attributed to the Informer's ProbSparse self-attention mechanism, which allows it to capture long-range dependencies in time-series data more effectively than other machine learning models. It effectively recognizes the half-life cycle of dissolved oxygen, with particular attention to key intervals. Our findings provide valuable insights for policymakers involved in ecological health assessments, aiding in accurate predictions of river water quality and the maintenance of healthy aquatic ecosystems.

Analyzing Spatio-Temporal Dynamics of Dissolved Oxygen for the River Thames using Superstatistical Methods and Machine Learning

TL;DR

Abstract

-Gaussian distributions. Our findings indicate that the multiplicative Empirical Mode Decomposition method stands out as the most effective detrending technique, yielding the highest log-likelihood in nearly all fittings. We also observe that the optimally fitted width parameter of the

-Gaussian shows a negative correlation with the distance to the sea, highlighting the influence of geographical factors on water quality dynamics. In the context of same-time prediction of dissolved oxygen, regression analysis incorporating various water quality indicators and temporal features identify the Light Gradient Boosting Machine as the best model. SHapley Additive exPlanations reveal that temperature, pH, and time of year play crucial roles in the predictions. Furthermore, we use the Transformer to forecast dissolved oxygen concentrations. For long-term forecasting, the Informer model consistently delivers superior performance, achieving the lowest MAE and SMAPE with the 192 historical time steps that we used. This performance is attributed to the Informer's ProbSparse self-attention mechanism, which allows it to capture long-range dependencies in time-series data more effectively than other machine learning models. It effectively recognizes the half-life cycle of dissolved oxygen, with particular attention to key intervals. Our findings provide valuable insights for policymakers involved in ecological health assessments, aiding in accurate predictions of river water quality and the maintenance of healthy aquatic ecosystems.

Paper Structure (19 sections, 13 equations, 8 figures, 2 tables)

This paper contains 19 sections, 13 equations, 8 figures, 2 tables.

Introduction
The data available
Detrending
Superstatistical analysis
Regression Analysis
Time series forecasting
Discussion
Conclusion
Methods
Acknowledgments
Author contributions
Competing interests
Additional information

Figures (8)

Figure 1: Top: A map of the nine available water quality monitoring sites (red markers) along the River Thames. The geographical visualisations were generated using the Folium Python library Folium with map data sourced from OpenStreetMap OpenStreetMap. We show the trajectory (a) and PDF (b) of the DO concentration at TCaP over a time span of five years. We apply additive seasonal (c), additive EMD (d), multiplicative seasonal (e) and multiplicative EMD (f) detrending methods. A filtering frequency of $f = 6$ hours or dropping $m = 3$ modes (orange) captures the oscillating trend while preserving the short-term fluctuations.
Figure 2: Scatter plot showing the relationship between log-likelihood values for different detrending methods and the distance to sea for various sites. The log-likelihood values are evaluated based on the best fit to a $q$-Gaussian distribution. The graph demonstrates the superiority of the multiplicative methods over the additive ones, with the multiplicative EMD method standing out as the most effective approach in nearly all cases.
Figure 3: Top: PDFs of oxygen fluctuations obtained via additive seasonal (a), additive EMD (b), multiplicative seasonal (c) and multiplicative EMD (d) detrending methods, for the example of the site TCaP. Regardless of the methods used, detrending leads to non-Gaussian distributions, which can be approximated by $q$-Gaussian distributions (purple). Bottom: For each site along the River Thames, we plot the scale parameter $\beta$ and the shape parameter $q$ from the $q$-Gaussian fitting against its distance to the Thames Estuary/sea.
Figure 4: TPut (top) and TKB (bottom) data plotted against the same-time predictions by LGBM, all in chronological order. The SHAP values are plotted on the right hand side for each site, showing the calculated feature importances.
Figure 5: Top: Example windows for the Informer's forecast of $t\in{12, 48}$ with input lengths of $48$ and $192$ time steps, respectively. The blue line represents the DO input at every time step. While the model processes all features, this visualization only displays the DO. The green dots depict the desired prediction values. The purple crosses represent the forecast of DO made by the Informer model in a single shot. Bottom: A visualization of the attention weights, represented through a heatmap, when predicting 48 future time steps from 192 past steps. The $x$-axis and $y$-axis correspond to the keys and queries, respectively, associated with each element in the sequence. Lighter colors correspond to higher weights.
...and 3 more figures

Analyzing Spatio-Temporal Dynamics of Dissolved Oxygen for the River Thames using Superstatistical Methods and Machine Learning

TL;DR

Abstract

Analyzing Spatio-Temporal Dynamics of Dissolved Oxygen for the River Thames using Superstatistical Methods and Machine Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)