Table of Contents
Fetching ...

Hybrid machine learning data assimilation for marine biogeochemistry

Ieuan Higgs, Ross Bannister, Jozef Skákala, Alberto Carrassi, Stefano Ciavatta

TL;DR

Marine biogeochemistry forecasting is limited by multivariate data assimilation challenges under sparse observations and computational constraints. The authors introduce two hybrid ML-DA approaches, ML-OI and ML-EtE, embedded in a 1D GOTM-FABM-ERSEM framework to learn flow-dependent correlations or end-to-end analysis increments for unobserved variables. They demonstrate that ML-augmented schemes significantly improve updates beyond total chlorophyll, with partial transferability to a new location and actionable pathways toward 3D scalability. The work provides a practical, computationally efficient route to enhance marine BGC forecasts and reanalyses, while identifying key research priorities in training data sampling, transferability, and larger-scale assimilation. Overall, the study shows that integrating neural-network-based correlation learning and end-to-end DA can overcome current bottlenecks in multivariate marine BGC data assimilation, enabling more accurate and scalable forecasts.

Abstract

Marine biogeochemistry models are critical for forecasting, as well as estimating ecosystem responses to climate change and human activities. Data assimilation (DA) improves these models by aligning them with real-world observations, but marine biogeochemistry DA faces challenges due to model complexity, strong nonlinearity, and sparse, uncertain observations. Existing DA methods applied to marine biogeochemistry struggle to update unobserved variables effectively, while ensemble-based methods are computationally too expensive for high-complexity marine biogeochemistry models. This study demonstrates how machine learning (ML) can improve marine biogeochemistry DA by learning statistical relationships between observed and unobserved variables. We integrate ML-driven balancing schemes into a 1D prototype of a system used to forecast marine biogeochemistry in the North-West European Shelf seas. ML is applied to predict (i) state-dependent correlations from free-run ensembles and (ii), in an ``end-to-end'' fashion, analysis increments from an Ensemble Kalman Filter. Our results show that ML significantly enhances updates for previously not-updated variables when compared to univariate schemes akin to those used operationally. Furthermore, ML models exhibit moderate transferability to new locations, a crucial step toward scaling these methods to 3D operational systems. We conclude that ML offers a clear pathway to overcome current computational bottlenecks in marine biogeochemistry DA and that refining transferability, optimizing training data sampling, and evaluating scalability for large-scale marine forecasting, should be future research priorities.

Hybrid machine learning data assimilation for marine biogeochemistry

TL;DR

Marine biogeochemistry forecasting is limited by multivariate data assimilation challenges under sparse observations and computational constraints. The authors introduce two hybrid ML-DA approaches, ML-OI and ML-EtE, embedded in a 1D GOTM-FABM-ERSEM framework to learn flow-dependent correlations or end-to-end analysis increments for unobserved variables. They demonstrate that ML-augmented schemes significantly improve updates beyond total chlorophyll, with partial transferability to a new location and actionable pathways toward 3D scalability. The work provides a practical, computationally efficient route to enhance marine BGC forecasts and reanalyses, while identifying key research priorities in training data sampling, transferability, and larger-scale assimilation. Overall, the study shows that integrating neural-network-based correlation learning and end-to-end DA can overcome current bottlenecks in multivariate marine BGC data assimilation, enabling more accurate and scalable forecasts.

Abstract

Marine biogeochemistry models are critical for forecasting, as well as estimating ecosystem responses to climate change and human activities. Data assimilation (DA) improves these models by aligning them with real-world observations, but marine biogeochemistry DA faces challenges due to model complexity, strong nonlinearity, and sparse, uncertain observations. Existing DA methods applied to marine biogeochemistry struggle to update unobserved variables effectively, while ensemble-based methods are computationally too expensive for high-complexity marine biogeochemistry models. This study demonstrates how machine learning (ML) can improve marine biogeochemistry DA by learning statistical relationships between observed and unobserved variables. We integrate ML-driven balancing schemes into a 1D prototype of a system used to forecast marine biogeochemistry in the North-West European Shelf seas. ML is applied to predict (i) state-dependent correlations from free-run ensembles and (ii), in an ``end-to-end'' fashion, analysis increments from an Ensemble Kalman Filter. Our results show that ML significantly enhances updates for previously not-updated variables when compared to univariate schemes akin to those used operationally. Furthermore, ML models exhibit moderate transferability to new locations, a crucial step toward scaling these methods to 3D operational systems. We conclude that ML offers a clear pathway to overcome current computational bottlenecks in marine biogeochemistry DA and that refining transferability, optimizing training data sampling, and evaluating scalability for large-scale marine forecasting, should be future research priorities.

Paper Structure

This paper contains 25 sections, 10 equations, 13 figures, 2 tables.

Figures (13)

  • Figure 1: Map of the Western English Channel, marking the L4 model-training location with a black cross and the CWEC (Central Western English Channel) with a red cross, where we evaluated the model portability.
  • Figure 2: The top panel shows a time series of surface concentrations of total chlorophyll (black) and nitrate (green) for an arbitrary year at the L4 location. The bottom panel shows the climatological correlation between total chlorophyll and nitrate, calculated across the 2000-2014 training period. Shading indicates the dominant seasonal system regimes: "light-limited" (white), "bloom" (light grey) and "nutrient-limited" (dark grey).
  • Figure 3: Predictions for correlation between total chlorophyll and nitrate, at weekly intervals across the 3-year offline test period for the ML-OI approach. The "true" correlation (black) is calculated from the 100-member free-run ensemble (Table \ref{['tab:run_types']}, row 2). The predictions by ML-OI (blue) are shown, with the RMS difference to the true correlation of 0.255. A daily climatology of correlations has also been calculated from the training data (red), with an RMS difference of 0.731. The seasonal regimes of Figure \ref{['fig:regime_highlight']} are repeated.
  • Figure 4: The relationship between analysis RMSE (Eq. \ref{['eq:expected_error']}) and ensemble size for EnKFs with different ensemble sizes, as well as the performance of the different single-model run schemes. The left panel shows the RMSE of the observed variable, total chlorophyll, normalised relative to the observational error. The right panel shows the RMSE of the unobserved variable, nitrate. The black dashed line represents the mean expected ensemble member error from 20 repeat experiments of an EnKF at increasing ensemble sizes, with the shaded grey area indicating $\pm 1$ standard deviation. The mean error and $\pm 1$ standard deviation of 64 independent single-model runs are also given for each of the methods summarised in Sects. \ref{['sec:da_setups']} and \ref{['sec:new_schemes']}.
  • Figure 5: A comparison of nitrate analysis increments produced in a single-model run in "online" cycled-DA. The first panel shows the analysis increments made during the climatological correlations CliC run (solid red) and the difference between the background state and the truth (dashed red). The second panel shows the analysis increments made by the ML-predicted correlations ML-OI run (solid blue) and the difference between the background state and the truth (dashed blue). The third panel shows the analysis increments predicted directly by the ML-EtE run (solid green) and the difference between the background state and the truth (dashed green). In each panel the $R$-value represents the correlation between the analysis increments and the difference between the background and the truth. Shading indicates the system regimes previously outlined in Sect. \ref{['sec:sys_dynamics']}: "light-limited" (white), "bloom" (light grey) and "nutrient-limited" (dark grey).
  • ...and 8 more figures