The impact of internal variability on benchmarking deep learning climate emulators
Björn Lütjens, Raffaele Ferrari, Duncan Watson-Parris, Noelle Selin
TL;DR
The paper investigates how internal climate variability affects benchmarking of data-driven climate emulators. By comparing linear pattern scaling (LPS) to deep-learning approaches on the ClimateBench benchmark and augmenting targets with a 50-realization Em-MPI ensemble, the authors show that LPS often outperforms DL emulators for regionally-resolved surface temperature and precipitation when internal variability is strong. The study demonstrates that DL models can overfit low-frequency variability with limited realizations, biasing benchmark conclusions; with larger ensembles, DL methods can surpass LPS for precipitation while LPS remains preferable for temperature. The work emphasizes evaluating emulators against robust baselines and using large ensembles to disentangle forced signals from internal noise, providing public code and data to support reproducibility and future benchmarking improvements.
Abstract
Full-complexity Earth system models (ESMs) are computationally very expensive, limiting their use in exploring the climate outcomes of multiple emission pathways. More efficient emulators that approximate ESMs can directly map emissions onto climate outcomes, and benchmarks are being used to evaluate their accuracy on standardized tasks and datasets. We investigate a popular benchmark in data-driven climate emulation, ClimateBench, on which deep learning-based emulators are currently achieving the best performance. We compare these deep learning emulators with a linear regression-based emulator, akin to pattern scaling, and show that it outperforms the incumbent 100M-parameter deep learning foundation model, ClimaX, on 3 out of 4 regionally-resolved climate variables, notably surface temperature and precipitation. While emulating surface temperature is expected to be predominantly linear, this result is surprising for emulating precipitation. Precipitation is a much more noisy variable, and we show that deep learning emulators can overfit to internal variability noise at low frequencies, degrading their performance in comparison to a linear emulator. We address the issue of overfitting by increasing the number of climate simulations per emission pathway (from 3 to 50) and updating the benchmark targets with the respective ensemble averages from the MPI-ESM1.2-LR model. Using the new targets, we show that linear pattern scaling continues to be more accurate on temperature, but can be outperformed by a deep learning-based technique for emulating precipitation. We publish our code and data at github.com/blutjens/climate-emulator.
