ClimDetect: A Benchmark Dataset for Climate Change Detection and Attribution
Sungduk Yu, Brian L. White, Anahita Bhiwandiwalla, Musashi Hinck, Matthew Lyle Olson, Yaniv Gurwicz, Raanan Y. Rohekar, Tung Nguyen, Vasudev Lal
TL;DR
ClimDetect provides a large, ML-ready benchmark for climate change detection and attribution by pairing daily CMIP6 inputs with targets such as $AGMT$ (annual global mean temperature) using variables $tas$, $huss$, and $pr$ on a $64 \times 128$ grid ($X \in \mathbb{R}^{64 \times 128 \times 3}$, $y \in \mathbb{R}$). The framework trains a regression model $y = F_{\theta}(X)$ on CMIP6 data, then conducts hypothesis tests against the natural variability distribution $P(y_{hist})$, with Year of Emergence ($YoE$) and emergence fraction thresholds guiding signal detection. The authors benchmark four Vision Transformers and traditional baselines (ridge, MLP, CNN) on ClimDetect and real-world reanalysis (ERA5, JRA-3Q, MERRA-2), showing ViTs often yield lower RMSE and earlier YoE, thereby improving detection sensitivity. The dataset and accompanying benchmarks are openly accessible via Hugging Face to promote reproducibility, comparability, and accelerated ML-driven climate change research.
Abstract
Detecting and attributing temperature increases driven by climate change is crucial for understanding global warming and informing adaptation strategies. However, distinguishing human-induced climate signals from natural variability remains challenging for traditional detection and attribution (D&A) methods, which rely on identifying specific "fingerprints" -- spatial patterns expected to emerge from external forcings such as greenhouse gas emissions. Deep learning offers promise in discerning these complex patterns within expansive spatial datasets, yet the lack of standardized protocols has hindered consistent comparisons across studies. To address this gap, we introduce ClimDetect, a standardized dataset comprising 1.17M daily climate snapshots paired with target climate change indicator variables. The dataset is curated from both CMIP6 climate model simulations and real-world observation-assimilated reanalysis datasets (ERA5, JRA-3Q, and MERRA-2), and is designed to enhance model accuracy in detecting climate change signals. ClimDetect integrates various input and target variables used in previous research, ensuring comparability and consistency across studies. We also explore the application of vision transformers (ViT) to climate data -- a novel approach that, to our knowledge, has not been attempted before for climate change detection tasks. Our open-access data serve as a benchmark for advancing climate science by enabling end-to-end model development and evaluation. ClimDetect is publicly accessible via Hugging Face dataset repository at: https://huggingface.co/datasets/ClimDetect/ClimDetect.
