Navigating the Noise: Bringing Clarity to ML Parameterization Design with O(100) Ensembles
Jerry Lin, Sungduk Yu, Liran Peng, Tom Beucler, Eliot Wong-Toi, Zeyuan Hu, Pierre Gentine, Margarita Geleta, Mike Pritchard
TL;DR
This work tackles the challenge of reliable online performance of ML subgrid parameterizations when embedded in a climate model by building ClimScale, an end-to-end pipeline that enables end-to-end online testing of roughly \\mathcal{O}(100) ensembles across nine NN configurations. The study finds that reductions in offline error do not guarantee improvements in online error or stability, with MAE loss improving online stability but increasing online error, and removing dropout boosting online error at the cost of stability; memory, batch normalization, and multi-climate training generally improve both online error and stability. A key empirical takeaway is that detecting causally relevant online differences requires large ensembles on the order of hundreds, underscoring the noise inherent in online hybrid simulations. These results demonstrate a scalable, reproducible approach to navigating the noise in hybrid ML-physics models and offer concrete guidance for designing robust online parameterizations in climate models and other geoscience systems.
Abstract
Machine-learning (ML) parameterizations of subgrid processes (here of turbulence, convection, and radiation) may one day replace conventional parameterizations by emulating high-resolution physics without the cost of explicit simulation. However, uncertainty about the relationship between offline and online performance (i.e., when integrated with a large-scale general circulation model (GCM)) hinders their development. Much of this uncertainty stems from limited sampling of the noisy, emergent effects of upstream ML design decisions on downstream online hybrid simulation. Our work rectifies the sampling issue via the construction of a semi-automated, end-to-end pipeline for $\mathcal{O}(100)$ size ensembles of hybrid simulations, revealing important nuances in how systematic reductions in offline error manifest in changes to online error and online stability. For example, removing dropout and switching from a Mean Squared Error (MSE) to a Mean Absolute Error (MAE) loss both reduce offline error, but they have opposite effects on online error and online stability. Other design decisions, like incorporating memory, converting moisture input from specific humidity to relative humidity, using batch normalization, and training on multiple climates do not come with any such compromises. Finally, we show that ensemble sizes of $\mathcal{O}(100)$ may be necessary to reliably detect causally relevant differences online. By enabling rapid online experimentation at scale, we can empirically settle debates regarding subgrid ML parameterization design that would have otherwise remained unresolved in the noise.
