The impact of internal variability on benchmarking deep learning climate emulators

Björn Lütjens; Raffaele Ferrari; Duncan Watson-Parris; Noelle Selin

The impact of internal variability on benchmarking deep learning climate emulators

Björn Lütjens, Raffaele Ferrari, Duncan Watson-Parris, Noelle Selin

TL;DR

The paper investigates how internal climate variability affects benchmarking of data-driven climate emulators. By comparing linear pattern scaling (LPS) to deep-learning approaches on the ClimateBench benchmark and augmenting targets with a 50-realization Em-MPI ensemble, the authors show that LPS often outperforms DL emulators for regionally-resolved surface temperature and precipitation when internal variability is strong. The study demonstrates that DL models can overfit low-frequency variability with limited realizations, biasing benchmark conclusions; with larger ensembles, DL methods can surpass LPS for precipitation while LPS remains preferable for temperature. The work emphasizes evaluating emulators against robust baselines and using large ensembles to disentangle forced signals from internal noise, providing public code and data to support reproducibility and future benchmarking improvements.

Abstract

Full-complexity Earth system models (ESMs) are computationally very expensive, limiting their use in exploring the climate outcomes of multiple emission pathways. More efficient emulators that approximate ESMs can directly map emissions onto climate outcomes, and benchmarks are being used to evaluate their accuracy on standardized tasks and datasets. We investigate a popular benchmark in data-driven climate emulation, ClimateBench, on which deep learning-based emulators are currently achieving the best performance. We compare these deep learning emulators with a linear regression-based emulator, akin to pattern scaling, and show that it outperforms the incumbent 100M-parameter deep learning foundation model, ClimaX, on 3 out of 4 regionally-resolved climate variables, notably surface temperature and precipitation. While emulating surface temperature is expected to be predominantly linear, this result is surprising for emulating precipitation. Precipitation is a much more noisy variable, and we show that deep learning emulators can overfit to internal variability noise at low frequencies, degrading their performance in comparison to a linear emulator. We address the issue of overfitting by increasing the number of climate simulations per emission pathway (from 3 to 50) and updating the benchmark targets with the respective ensemble averages from the MPI-ESM1.2-LR model. Using the new targets, we show that linear pattern scaling continues to be more accurate on temperature, but can be outperformed by a deep learning-based technique for emulating precipitation. We publish our code and data at github.com/blutjens/climate-emulator.

The impact of internal variability on benchmarking deep learning climate emulators

TL;DR

Abstract

Paper Structure (39 sections, 17 equations, 18 figures, 2 tables)

This paper contains 39 sections, 17 equations, 18 figures, 2 tables.

Introduction
Data & Methods
Background on the ClimateBench benchmark
Em-MPI data: Addressing internal variability with the MPI-ESM1.2-LR ensemble
Linear Pattern Scaling emulator
Review of the CNN-LSTM emulator
Testing the influence of internal variability on emulator performance
A heuristic model to illustrate internal variability effects
Results
Evaluation of Linear Pattern Scaling on ClimateBench
Regional accuracy of LPS
Quantitative comparison of LPS with deep learning emulators
Analysing the relationship between internal variability and performance assessments of climate emulators
Magnitude of internal variability in 3-member NorESM2-LM and 50-member MPI-ESM1.2-LR ensemble average
Effect of internal variability on benchmark scores
...and 24 more sections

Figures (18)

Figure 1: A cartoon illustrating the impact of internal variability on benchmarking deep learning emulators. The figure is generated with a mock-up stochastic model described in \ref{['sec:overfitting_exp']} . The model is a nonlinear function that relates fictitious greenhouse gas emissions to nonlinear changes in an imagined climate variable. Each realization of this function is noisy, representing fluctuations from internal variability in the climate system that remains fundamentally unpredictable over long timescales. When training an emulator to approximate this model when it has only been run a few times for the given emission scenario, neural networks can overfit to low-frequency components of these fluctuations (top-row). A simple linear approach represents the emission-forced signal more accurately, despite the opposite being the case when many realizations are available (bottom-row). To compare emulation techniques more reliably, we recommend using large climate model ensembles that attenuate the influence of internal variability.
Figure 2: Internal variability in 3-member NorESM2-LM vs. 50-member MPI-ESM1.2-LR ensemble-mean. The plots show the ensemble-mean of the global (black) and regionally-averaged surface temperature (top-red) and precipitation (bottom-blue) anomalies from the $historical$ and $ssp245$ scenario. The left plots show the NorESM data that is used in the ClimateBench test set and the middle plots show the Em-MPI data. The regional averages are calculated over displayed IPCC AR6 reference regions (right). The interannual fluctuations from internal variability are visibly smaller in the 50-member Em-MPI data in comparison to the 3-member NorESM data, especially at the regional scale. More regions are shown in \ref{['fig:internal_variability_selected_regions']} .
Figure 3: Functional relationships in cumulative $\mathrm{CO_2}$ emissions, surface temperature, and precipitation. The left plot shows the almost linear relationship between cumulative $\mathrm{CO}_2$ emissions (x-axis) and global ensemble-mean surface temperature anomalies (y-axis) for each scenario in the Em-MPI data. The middle and right plot show the ensemble-mean local surface temperature and precipitation anomalies, respectively, averaged over each year (dot) and the IPCC AR6 region S.E.Asia, against the annually-averaged global surface temperature anomaly. We selected S.E.Asia to highlight the contrast between linear and more complex relationships in temperature vs. precipitation and show other regions in \ref{['fig:linearity_for_many_regions']} .
Figure 4: Linear pattern scaling error map. The left plot shows the target surface temperature anomalies ($\mathrm{tas}$) from the ssp245 ClimateBench test set, which are averages over 3 realizations and 21 years (2080-2100). The middle plot shows the linear pattern scaling predictions and the right plot the error of predictions minus the target. The other variables are plotted in \ref{['fig:linear_pattern_scaling_error_map_diurnal_temperature_range', 'fig:linear_pattern_scaling_error_map_precipitation', 'fig:linear_pattern_scaling_error_map_90th_precipitation']}.
Figure 5: Error over realizations in training set for precipitation in mm/day . The top figure in our internal variability experiment shows the spatial RMSE, i.e., $\mathrm{RMSE}_{s,n}$, of an LPS (orange) and CNN-LSTM emulator (blue) that were trained on the ensemble-mean of data subsets with $n$ realizations and evaluated on the ensemble-mean of $N=50$ realizations from the Em-MPI data. Shading indicates the standard deviation across $K=20$ random draws of realization subsets. The bottom figure shows the difference in spatial RMSE, i.e., $\Delta \mathrm{RMSE}_{s,n}$, between the two emulators for each random realization subset, $(n,k)$, (green dots); and the mean and standard deviation across $K$ subsets (black line and shading). Data is plotted on a log x-axis, because most climate models are run for $1$ to $10$ realizations per scenario. As a side note, the $\pm \sigma$ range in the bottom figure is not the average of the two $\sigma$ ranges in the top figure, because an emulator's RMSE covaries with the subset, i.e., if one emulator has low RMSE the other emulator is also likely to have low RMSE.
...and 13 more figures

The impact of internal variability on benchmarking deep learning climate emulators

TL;DR

Abstract

The impact of internal variability on benchmarking deep learning climate emulators

Authors

TL;DR

Abstract

Table of Contents

Figures (18)