CarbonSense: A Multimodal Dataset and Baseline for Carbon Flux Modelling
Matthew Fortier, Mats L. Richter, Oliver Sonnentag, Chris Pal
TL;DR
CarbonSense addresses the lack of standardized data for data-driven carbon flux modelling by providing the first ML-ready multimodal dataset combining eddy covariance flux measurements with MODIS geospatial data from 385 sites and over 27 million hourly observations. The authors introduce EcoPerceiver, a transformer-based multimodal architecture that leverages windowed cross attention to integrate meteorological, geospatial, and semantic inputs, and compare it to a strong XGBoost baseline. Results show that EcoPerceiver achieves higher NSE and lower RMSE across most ecosystem types, and demonstrates superior generalization to out-of-distribution sites, suggesting multimodal deep learning can significantly improve carbon flux predictions. The dataset, baselines, and experimental guidelines promote reproducibility and accelerate progress in global carbon flux modelling, with potential impacts on climate decision-making.
Abstract
Terrestrial carbon fluxes provide vital information about our biosphere's health and its capacity to absorb anthropogenic CO$_2$ emissions. The importance of predicting carbon fluxes has led to the emerging field of data-driven carbon flux modelling (DDCFM), which uses statistical techniques to predict carbon fluxes from biophysical data. However, the field lacks a standardized dataset to promote comparisons between models. To address this gap, we present CarbonSense, the first machine learning-ready dataset for DDCFM. CarbonSense integrates measured carbon fluxes, meteorological predictors, and satellite imagery from 385 locations across the globe, offering comprehensive coverage and facilitating robust model training. Additionally, we provide a baseline model using a current state-of-the-art DDCFM approach and a novel transformer based model. Our experiments illustrate the potential gains that multimodal deep learning techniques can bring to this domain. By providing these resources, we aim to lower the barrier to entry for other deep learning researchers to develop new models and drive new advances in carbon flux modelling.
