Navigating the Noise: Bringing Clarity to ML Parameterization Design with O(100) Ensembles

Jerry Lin; Sungduk Yu; Liran Peng; Tom Beucler; Eliot Wong-Toi; Zeyuan Hu; Pierre Gentine; Margarita Geleta; Mike Pritchard

Navigating the Noise: Bringing Clarity to ML Parameterization Design with O(100) Ensembles

Jerry Lin, Sungduk Yu, Liran Peng, Tom Beucler, Eliot Wong-Toi, Zeyuan Hu, Pierre Gentine, Margarita Geleta, Mike Pritchard

TL;DR

This work tackles the challenge of reliable online performance of ML subgrid parameterizations when embedded in a climate model by building ClimScale, an end-to-end pipeline that enables end-to-end online testing of roughly \\mathcal{O}(100) ensembles across nine NN configurations. The study finds that reductions in offline error do not guarantee improvements in online error or stability, with MAE loss improving online stability but increasing online error, and removing dropout boosting online error at the cost of stability; memory, batch normalization, and multi-climate training generally improve both online error and stability. A key empirical takeaway is that detecting causally relevant online differences requires large ensembles on the order of hundreds, underscoring the noise inherent in online hybrid simulations. These results demonstrate a scalable, reproducible approach to navigating the noise in hybrid ML-physics models and offer concrete guidance for designing robust online parameterizations in climate models and other geoscience systems.

Abstract

Machine-learning (ML) parameterizations of subgrid processes (here of turbulence, convection, and radiation) may one day replace conventional parameterizations by emulating high-resolution physics without the cost of explicit simulation. However, uncertainty about the relationship between offline and online performance (i.e., when integrated with a large-scale general circulation model (GCM)) hinders their development. Much of this uncertainty stems from limited sampling of the noisy, emergent effects of upstream ML design decisions on downstream online hybrid simulation. Our work rectifies the sampling issue via the construction of a semi-automated, end-to-end pipeline for $\mathcal{O}(100)$ size ensembles of hybrid simulations, revealing important nuances in how systematic reductions in offline error manifest in changes to online error and online stability. For example, removing dropout and switching from a Mean Squared Error (MSE) to a Mean Absolute Error (MAE) loss both reduce offline error, but they have opposite effects on online error and online stability. Other design decisions, like incorporating memory, converting moisture input from specific humidity to relative humidity, using batch normalization, and training on multiple climates do not come with any such compromises. Finally, we show that ensemble sizes of $\mathcal{O}(100)$ may be necessary to reliably detect causally relevant differences online. By enabling rapid online experimentation at scale, we can empirically settle debates regarding subgrid ML parameterization design that would have otherwise remained unresolved in the noise.

Navigating the Noise: Bringing Clarity to ML Parameterization Design with O(100) Ensembles

TL;DR

Abstract

size ensembles of hybrid simulations, revealing important nuances in how systematic reductions in offline error manifest in changes to online error and online stability. For example, removing dropout and switching from a Mean Squared Error (MSE) to a Mean Absolute Error (MAE) loss both reduce offline error, but they have opposite effects on online error and online stability. Other design decisions, like incorporating memory, converting moisture input from specific humidity to relative humidity, using batch normalization, and training on multiple climates do not come with any such compromises. Finally, we show that ensemble sizes of

may be necessary to reliably detect causally relevant differences online. By enabling rapid online experimentation at scale, we can empirically settle debates regarding subgrid ML parameterization design that would have otherwise remained unresolved in the noise.

Paper Structure (21 sections, 4 equations, 40 figures, 5 tables)

This paper contains 21 sections, 4 equations, 40 figures, 5 tables.

Introduction
Methods
Reference Climate Simulation
Training, Validation, and Offline Test Data
End-to-End Pipeline and Analysis
NN Configurations
Standard Configuration [Baseline]
Specific Humidity Configuration [Backward ablation]
No Memory Configuration [Backward ablation]
No Wind Configuration [Backward ablation]
No Ozone Configuration [Backward ablation]
No Zenith Angle Configuration [Backward ablation]
Mean Absolute Error (MAE) Configuration [Forward ablation]
No Dropout Configuration [Forward ablation]
Multiclimate Configuration [Forward ablation]
...and 6 more sections

Figures (40)

Figure 1: This is a diagram showcasing ClimScale, our end-to-end pipeline for going from preprocessing to online results. Associated code can be found in our GitHub repository: https://github.com/SciPritchardLab/ClimScale.
Figure 2: Offline test RMSE for subgrid heating (1a,1c,1e) and moistening (1b,1d,1f) tendencies across configurations are plotted against validation error rank. 1a and 1b show RMSE for average, multiple linear regression (MLR), and martingale baselines. The MLR baseline makes use of inputs and outputs from the standard configuration. 1e and 1f show configurations with statistically distinct average RMSE, and 1c and 1d show the others. Figure S2 in the SI presents an analogous figure using a multiclimate test set.
Figure 3: Histograms of online temperature RMSE in K for each configuration compared against that of the standard configuration are shown here. Only models that integrated for the entire simulation duration (i.e., did not crash while integrating a full simulation year) are shown, and number of surviving models per configuration is shown in the legend. Vertical lines correspond to ensemble-median online temperature RMSE. Bin-width $= \frac{10}{49} K \approx .204$ K.
Figure 4: Histograms of online moisture RMSE in g/kg for each configuration compared against that of the standard configuration are shown here. Only models that integrated for the entire simulation duration (i.e., did not crash while integrating a full simulation year) are shown, and number of surviving models per configuration is shown in the legend. Vertical lines correspond to ensemble-median online moisture RMSE. Bin-width $= \frac{2}{49} \approx .0408$ g/kg.
Figure 5: This figure shows zonal mean temperature biases for the top five hybrid models across the standard, specific, no dropout, and multiclimate configurations. Models to the left have lower online error.
...and 35 more figures

Navigating the Noise: Bringing Clarity to ML Parameterization Design with O(100) Ensembles

TL;DR

Abstract

Navigating the Noise: Bringing Clarity to ML Parameterization Design with O(100) Ensembles

Authors

TL;DR

Abstract

Table of Contents

Figures (40)