Transformers vs. Recurrent Models for Estimating Forest Gross Primary Production
David Montero, Miguel D. Mahecha, Francesco Martinuzzi, César Aybar, Anne Klosterhalfen, Alexander Knohl, Jesús Anaya, Clemens Mosig, Sebastian Wieneke
TL;DR
Forest daily $GPP$ prediction remains challenging due to sparse flux data and climate extremes. The study systematically contrasts a Transformer ($GPT$-2) and an LSTM for multimodal, satellite-based GPP estimation, using Sentinel-2, Sentinel-1, MODIS LST, and a physically based solar radiation proxy $R_{so}$ over 2016–2020. Both models reliably reproduce seasonal GPP, with GPT-2 providing superior performance during extreme drought/heat events, at the cost of longer context requirements and greater data demand; LSTM offers strong accuracy with substantially shorter context windows. Across modalities, $R_{so}$ emerges as the dominant predictor, with Sentinel-2 signals and thermal information contributing meaningfully, while Sentinel-1 adds limited value. These findings guide future DL framework design for monitoring terrestrial carbon dynamics and resilience under climate stress.
Abstract
Monitoring the spatiotemporal dynamics of forest CO$_2$ uptake (Gross Primary Production, GPP), remains a central challenge in terrestrial ecosystem research. While Eddy Covariance (EC) towers provide high-frequency estimates, their limited spatial coverage constrains large-scale assessments. Remote sensing offers a scalable alternative, yet most approaches rely on single-sensor spectral indices and statistical models that are often unable to capture the complex temporal dynamics of GPP. Recent advances in deep learning (DL) and data fusion offer new opportunities to better represent the temporal dynamics of vegetation processes, but comparative evaluations of state-of-the-art DL models for multimodal GPP prediction remain scarce. Here, we explore the performance of two representative models for predicting GPP: 1) GPT-2, a transformer architecture, and 2) Long Short-Term Memory (LSTM), a recurrent neural network, using multivariate inputs. Overall, both achieve similar accuracy. But, while LSTM performs better overall, GPT-2 excels during extreme events. Analysis of temporal context length further reveals that LSTM attains similar accuracy using substantially shorter input windows than GPT-2, highlighting an accuracy-efficiency trade-off between the two architectures. Feature importance analysis reveals radiation as the dominant predictor, followed by Sentinel-2, MODIS land surface temperature, and Sentinel-1 contributions. Our results demonstrate how model architecture, context length, and multimodal inputs jointly determine performance in GPP prediction, guiding future developments of DL frameworks for monitoring terrestrial carbon dynamics.
