Table of Contents
Fetching ...

From Remote Sensing to Multiple Time Horizons Forecasts: Transformers Model for CyanoHAB Intensity in Lake Champlain

Muhammad Adil, Patrick J. Clemins, Andrew W. Schroth, Panagiotis D. Oikonomou, Donna M. Rizzo, Peter D. F. Isles, Xiaohan Zhang, Kareem I. Hannoun, Scott Turnbull, Noah B. Beckage, Asim Zia, Safwan Wshah

TL;DR

This study tackles forecasting cyanobacterial blooms in Lake Champlain using a remote-sensing–only approach. It introduces a Transformer-BiLSTM architecture with TS-FAN to predict five intensity classes across a 14-day horizon, leveraging CI from CyAN and MODIS temperature while addressing extreme data sparsity with a two-stage imputation and seasonality-aware augmentation. The model outperforms traditional ML and other deep-learning baselines in both short- and long-range forecasts, demonstrating robust generalization to novel bloom patterns and inter-annual variability. The framework is designed for transferability to other US freshwater systems and is supported by a publicly available data-processing pipeline, making it a practical tool for early warning and water-resource management.

Abstract

Cyanobacterial Harmful Algal Blooms (CyanoHABs) pose significant threats to aquatic ecosystems and public health globally. Lake Champlain is particularly vulnerable to recurring CyanoHAB events, especially in its northern segment: Missisquoi Bay, St. Albans Bay, and Northeast Arm, due to nutrient enrichment and climatic variability. Remote sensing provides a scalable solution for monitoring and forecasting these events, offering continuous coverage where in situ observations are sparse or unavailable. In this study, we present a remote sensing only forecasting framework that combines Transformers and BiLSTM to predict CyanoHAB intensities up to 14 days in advance. The system utilizes Cyanobacterial Index data from the Cyanobacterial Assessment Network and temperature data from Moderate Resolution Imaging Spectroradiometer satellites to capture long range dependencies and sequential dynamics in satellite time series. The dataset is very sparse, missing more than 30% of the Cyanobacterial Index data and 90% of the temperature data. A two stage preprocessing pipeline addressed data gaps by applying forward fill and weighted temporal imputation at the pixel level, followed by smoothing to reduce the discontinuities of CyanoHAB events. The raw dataset is transformed into meaningful features through equal frequency binning for the Cyanobacterial Index values and extracted temperature statistics. Transformer BiLSTM model demonstrates strong forecasting performance across multiple horizons, achieving F1 scores of 89.5%, 86.4%, and 85.5% at one, two, and three-day forecasts, respectively, and maintaining an F1 score of 78.9% with an AUC of 82.6% at the 14-day horizon. These results confirm the model's ability to capture complex spatiotemporal dynamics from sparse satellite data and to provide reliable early warning for CyanoHABs management.

From Remote Sensing to Multiple Time Horizons Forecasts: Transformers Model for CyanoHAB Intensity in Lake Champlain

TL;DR

This study tackles forecasting cyanobacterial blooms in Lake Champlain using a remote-sensing–only approach. It introduces a Transformer-BiLSTM architecture with TS-FAN to predict five intensity classes across a 14-day horizon, leveraging CI from CyAN and MODIS temperature while addressing extreme data sparsity with a two-stage imputation and seasonality-aware augmentation. The model outperforms traditional ML and other deep-learning baselines in both short- and long-range forecasts, demonstrating robust generalization to novel bloom patterns and inter-annual variability. The framework is designed for transferability to other US freshwater systems and is supported by a publicly available data-processing pipeline, making it a practical tool for early warning and water-resource management.

Abstract

Cyanobacterial Harmful Algal Blooms (CyanoHABs) pose significant threats to aquatic ecosystems and public health globally. Lake Champlain is particularly vulnerable to recurring CyanoHAB events, especially in its northern segment: Missisquoi Bay, St. Albans Bay, and Northeast Arm, due to nutrient enrichment and climatic variability. Remote sensing provides a scalable solution for monitoring and forecasting these events, offering continuous coverage where in situ observations are sparse or unavailable. In this study, we present a remote sensing only forecasting framework that combines Transformers and BiLSTM to predict CyanoHAB intensities up to 14 days in advance. The system utilizes Cyanobacterial Index data from the Cyanobacterial Assessment Network and temperature data from Moderate Resolution Imaging Spectroradiometer satellites to capture long range dependencies and sequential dynamics in satellite time series. The dataset is very sparse, missing more than 30% of the Cyanobacterial Index data and 90% of the temperature data. A two stage preprocessing pipeline addressed data gaps by applying forward fill and weighted temporal imputation at the pixel level, followed by smoothing to reduce the discontinuities of CyanoHAB events. The raw dataset is transformed into meaningful features through equal frequency binning for the Cyanobacterial Index values and extracted temperature statistics. Transformer BiLSTM model demonstrates strong forecasting performance across multiple horizons, achieving F1 scores of 89.5%, 86.4%, and 85.5% at one, two, and three-day forecasts, respectively, and maintaining an F1 score of 78.9% with an AUC of 82.6% at the 14-day horizon. These results confirm the model's ability to capture complex spatiotemporal dynamics from sparse satellite data and to provide reliable early warning for CyanoHABs management.

Paper Structure

This paper contains 37 sections, 20 equations, 15 figures, 1 table.

Figures (15)

  • Figure 1: The visual illustration of Lake Champlain shows 12 segments and their CyanoHABs events as a Bar Chart on the right side from 2016 to 2023. We can see that Missisquoi Bay and St. Albans Bay experienced an extensive number of CyanoHABs during that duration. The right side map shows the three stations - Missisquoi Bay, St. Albans Bay, and Northeast Arm - used in the study.
  • Figure 2: Complete data processing pipeline for CyanoHABs prediction, from data acquisition through class balancing. The workflow consists of four stages: data downloading (CI and temperature), data sparsity management through imputation, data extraction for feature engineering, and class balancing using temporal sampling strategies.
  • Figure 3: Visualization of data sparsity. B. represents before while A. represents after. (a) Shows Cyanobacterial Index Values missing percentage at different imputation stages, while (b) shows Temperature data missing percentage. The green bars illustrate around 30% of the original Cyanobacterial Index Values and around 90% of the Temperature data are missing. The orange bar shows the dataset after performing Forward Fill, indicating a significant addition of data points. The purple represents the dataset after Weighted Temporal Window Imputation, indicating further reduction in data missing percentage. We constrained further imputation to avoid adding too much noise.
  • Figure 4: The diagram illustrates the Transformer-BiLSTM model for CyanoHABs Intensity forecasting. The system processes 15 days of remote sensing inputs through an embedding layer, positional encoding, a Transformer encoder with multi-head self-attention, and a bidirectional LSTM block. The final output layer predicts five CyanoHAB intensity classes (Low to Extreme) across a 14-day forecast horizon. The pipeline incorporates preprocessing steps to address data sparsity and ensure class balance.
  • Figure 5: Visualization F1 scores of each model for each day across the 14-day forecast horizon for all segments. F1 score trends across a 14-day forecast horizon for all individual models and the Transformer-BiLSTM model. Forecast performance generally declines with increasing lead time. The Transformer-BiLSTM outperforms individual model across longer-forecast ranges and perform competitively for short and medium forecast. Vertical dashed lines demarcate short-range (Days 1–4), medium-range (Days 5–9), and long-range (Days 10–14) forecasts. Circular markers denote discrete evaluation points at each forecast day.
  • ...and 10 more figures