Table of Contents
Fetching ...

Transformers vs. Recurrent Models for Estimating Forest Gross Primary Production

David Montero, Miguel D. Mahecha, Francesco Martinuzzi, César Aybar, Anne Klosterhalfen, Alexander Knohl, Jesús Anaya, Clemens Mosig, Sebastian Wieneke

TL;DR

Forest daily $GPP$ prediction remains challenging due to sparse flux data and climate extremes. The study systematically contrasts a Transformer ($GPT$-2) and an LSTM for multimodal, satellite-based GPP estimation, using Sentinel-2, Sentinel-1, MODIS LST, and a physically based solar radiation proxy $R_{so}$ over 2016–2020. Both models reliably reproduce seasonal GPP, with GPT-2 providing superior performance during extreme drought/heat events, at the cost of longer context requirements and greater data demand; LSTM offers strong accuracy with substantially shorter context windows. Across modalities, $R_{so}$ emerges as the dominant predictor, with Sentinel-2 signals and thermal information contributing meaningfully, while Sentinel-1 adds limited value. These findings guide future DL framework design for monitoring terrestrial carbon dynamics and resilience under climate stress.

Abstract

Monitoring the spatiotemporal dynamics of forest CO$_2$ uptake (Gross Primary Production, GPP), remains a central challenge in terrestrial ecosystem research. While Eddy Covariance (EC) towers provide high-frequency estimates, their limited spatial coverage constrains large-scale assessments. Remote sensing offers a scalable alternative, yet most approaches rely on single-sensor spectral indices and statistical models that are often unable to capture the complex temporal dynamics of GPP. Recent advances in deep learning (DL) and data fusion offer new opportunities to better represent the temporal dynamics of vegetation processes, but comparative evaluations of state-of-the-art DL models for multimodal GPP prediction remain scarce. Here, we explore the performance of two representative models for predicting GPP: 1) GPT-2, a transformer architecture, and 2) Long Short-Term Memory (LSTM), a recurrent neural network, using multivariate inputs. Overall, both achieve similar accuracy. But, while LSTM performs better overall, GPT-2 excels during extreme events. Analysis of temporal context length further reveals that LSTM attains similar accuracy using substantially shorter input windows than GPT-2, highlighting an accuracy-efficiency trade-off between the two architectures. Feature importance analysis reveals radiation as the dominant predictor, followed by Sentinel-2, MODIS land surface temperature, and Sentinel-1 contributions. Our results demonstrate how model architecture, context length, and multimodal inputs jointly determine performance in GPP prediction, guiding future developments of DL frameworks for monitoring terrestrial carbon dynamics.

Transformers vs. Recurrent Models for Estimating Forest Gross Primary Production

TL;DR

Forest daily prediction remains challenging due to sparse flux data and climate extremes. The study systematically contrasts a Transformer (-2) and an LSTM for multimodal, satellite-based GPP estimation, using Sentinel-2, Sentinel-1, MODIS LST, and a physically based solar radiation proxy over 2016–2020. Both models reliably reproduce seasonal GPP, with GPT-2 providing superior performance during extreme drought/heat events, at the cost of longer context requirements and greater data demand; LSTM offers strong accuracy with substantially shorter context windows. Across modalities, emerges as the dominant predictor, with Sentinel-2 signals and thermal information contributing meaningfully, while Sentinel-1 adds limited value. These findings guide future DL framework design for monitoring terrestrial carbon dynamics and resilience under climate stress.

Abstract

Monitoring the spatiotemporal dynamics of forest CO uptake (Gross Primary Production, GPP), remains a central challenge in terrestrial ecosystem research. While Eddy Covariance (EC) towers provide high-frequency estimates, their limited spatial coverage constrains large-scale assessments. Remote sensing offers a scalable alternative, yet most approaches rely on single-sensor spectral indices and statistical models that are often unable to capture the complex temporal dynamics of GPP. Recent advances in deep learning (DL) and data fusion offer new opportunities to better represent the temporal dynamics of vegetation processes, but comparative evaluations of state-of-the-art DL models for multimodal GPP prediction remain scarce. Here, we explore the performance of two representative models for predicting GPP: 1) GPT-2, a transformer architecture, and 2) Long Short-Term Memory (LSTM), a recurrent neural network, using multivariate inputs. Overall, both achieve similar accuracy. But, while LSTM performs better overall, GPT-2 excels during extreme events. Analysis of temporal context length further reveals that LSTM attains similar accuracy using substantially shorter input windows than GPT-2, highlighting an accuracy-efficiency trade-off between the two architectures. Feature importance analysis reveals radiation as the dominant predictor, followed by Sentinel-2, MODIS land surface temperature, and Sentinel-1 contributions. Our results demonstrate how model architecture, context length, and multimodal inputs jointly determine performance in GPP prediction, guiding future developments of DL frameworks for monitoring terrestrial carbon dynamics.

Paper Structure

This paper contains 22 sections, 4 figures.

Figures (4)

  • Figure 1: Normalized Root Mean Squared Error (NRMSE) for daily GPP predictions across four conditions: all timesteps (GPP$^{\textrm{Overall}}$), growing season (GPP$^{\textrm{Growing}}$), climate-induced extremes (GPP$^{-}$), and peaks of photosynthetic activity (GPP$^{+}$). Bars represent the median NRMSE, boxes the interquartile range (Q1-Q3), and whiskers span the 5th to 95th percentile across sites.
  • Figure 2: Permutation-based memory retention analysis across the 120-day context window. The y-axis shows the resulting NRMSE after prediction on the permuted inputs. Lines represent the median NRMSE across sites, while shaded areas show the interquartile range (darker) and the 5th-95th percentile range (lighter).
  • Figure 3: Minimum NRMSE achieved by each model for each GPP condition, plotted against the $t-\tau$ value in the 120-day context window where it was observed. Each point represents the condition-specific minimum NRMSE across the permuted sequence for either GPT-2 or LSTM.
  • Figure 4: Feature importance (FI) values for each input modality. Bars represent the median FI across sites, with thick and thin error bars indicating interquartile and 5-95% ranges, respectively.