Table of Contents
Fetching ...

Harmful algal bloom forecasting. A comparison between stream and batch learning

Andres Molares-Ulloa, Elisabet Rocruz, Daniel Rivero, Xosé A. Padin, Rita Nolasco, Jesús Dubert, Enrique Fernandez-Blanco

TL;DR

This work tackles harmful algal bloom forecasting for Dinophysis acuminata by directly comparing Stream Learning and Batch Learning across seven algorithms using daily CROCO-derived environmental data. The authors demonstrate that DoME, a symbolic regression approach, achieves the best $R^2$ of about $0.77$ for a $3$-day ahead forecast across six stations, while also offering interpretable predictive equations. The study shows that PCA-based feature reduction has mixed effects depending on the model and location, and that, within the data period analyzed (2013–2019), Stream Learning does not outperform Batch Learning. Overall, leveraging CROCO outputs enables daily HAB predictions in data-scarce oceanographic settings and highlights DoME’s practical value for aquaculture management.

Abstract

Diarrhetic Shellfish Poisoning (DSP) is a global health threat arising from shellfish contaminated with toxins produced by dinoflagellates. The condition, with its widespread incidence, high morbidity rate, and persistent shellfish toxicity, poses risks to public health and the shellfish industry. High biomass of toxin-producing algae such as DSP are known as Harmful Algal Blooms (HABs). Monitoring and forecasting systems are crucial for mitigating HABs impact. Predicting harmful algal blooms involves a time-series-based problem with a strong historical seasonal component, however, recent anomalies due to changes in meteorological and oceanographic events have been observed. Stream Learning stands out as one of the most promising approaches for addressing time-series-based problems with concept drifts. However, its efficacy in predicting HABs remains unproven and needs to be tested in comparison with Batch Learning. Historical data availability is a critical point in developing predictive systems. In oceanography, the available data collection can have some constrains and limitations, which has led to exploring new tools to obtain more exhaustive time series. In this study, a machine learning workflow for predicting the number of cells of a toxic dinoflagellate, Dinophysis acuminata, was developed with several key advancements. Seven machine learning algorithms were compared within two learning paradigms. Notably, the output data from CROCO, the ocean hydrodynamic model, was employed as the primary dataset, palliating the limitation of time-continuous historical data. This study highlights the value of models interpretability, fair models comparison methodology, and the incorporation of Stream Learning models. The model DoME, with an average R2 of 0.77 in the 3-day-ahead prediction, emerged as the most effective and interpretable predictor, outperforming the other algorithms.

Harmful algal bloom forecasting. A comparison between stream and batch learning

TL;DR

This work tackles harmful algal bloom forecasting for Dinophysis acuminata by directly comparing Stream Learning and Batch Learning across seven algorithms using daily CROCO-derived environmental data. The authors demonstrate that DoME, a symbolic regression approach, achieves the best of about for a -day ahead forecast across six stations, while also offering interpretable predictive equations. The study shows that PCA-based feature reduction has mixed effects depending on the model and location, and that, within the data period analyzed (2013–2019), Stream Learning does not outperform Batch Learning. Overall, leveraging CROCO outputs enables daily HAB predictions in data-scarce oceanographic settings and highlights DoME’s practical value for aquaculture management.

Abstract

Diarrhetic Shellfish Poisoning (DSP) is a global health threat arising from shellfish contaminated with toxins produced by dinoflagellates. The condition, with its widespread incidence, high morbidity rate, and persistent shellfish toxicity, poses risks to public health and the shellfish industry. High biomass of toxin-producing algae such as DSP are known as Harmful Algal Blooms (HABs). Monitoring and forecasting systems are crucial for mitigating HABs impact. Predicting harmful algal blooms involves a time-series-based problem with a strong historical seasonal component, however, recent anomalies due to changes in meteorological and oceanographic events have been observed. Stream Learning stands out as one of the most promising approaches for addressing time-series-based problems with concept drifts. However, its efficacy in predicting HABs remains unproven and needs to be tested in comparison with Batch Learning. Historical data availability is a critical point in developing predictive systems. In oceanography, the available data collection can have some constrains and limitations, which has led to exploring new tools to obtain more exhaustive time series. In this study, a machine learning workflow for predicting the number of cells of a toxic dinoflagellate, Dinophysis acuminata, was developed with several key advancements. Seven machine learning algorithms were compared within two learning paradigms. Notably, the output data from CROCO, the ocean hydrodynamic model, was employed as the primary dataset, palliating the limitation of time-continuous historical data. This study highlights the value of models interpretability, fair models comparison methodology, and the incorporation of Stream Learning models. The model DoME, with an average R2 of 0.77 in the 3-day-ahead prediction, emerged as the most effective and interpretable predictor, outperforming the other algorithms.
Paper Structure (20 sections, 4 equations, 5 figures, 6 tables)

This paper contains 20 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Map of the Galician Rias Baixas, Arousa, Pontevedra and Vigo with the INTECMAR stations (coloured dots within the Rias), Upwelling Index station (blue dot at the shelf) and sections (black lines). Red dots represent the stations where the prediction was done, orange dots the stations which input data was used for the prediction. In the section, the first half goes from the black square to the central cross-line and the second half from here to the end.
  • Figure 2: Schematic representation of the machine learning-based system proposed for HAB forecasting.
  • Figure 3: Heat map showing the $R^2$ score obtained by the best model for each predicted station, from 1 to 7 days of prediction. In addition, this map shows the model that obtained this result, the learning paradigm used and whether it was obtained with or without PCA. Outer stations are indicated with an asterisk.
  • Figure 4: Box plot showing the results obtained by applying the models to predict at six monitoring stations and 3-days ahead.
  • Figure 5: Graphs with the concentration values of observed D. acuminata and predicted by DoME and HTR/HATR (HTR at A8, P2 and V4; HATR at A3, P4 and V1). These models give the best results for BL and SL respectively. The predictions made in these graphs are made 3 days in advance of the test period.