Table of Contents
Fetching ...

Crowdsourcing the Frontier: Advancing Hybrid Physics-ML Climate Simulation via a $50,000 Kaggle Competition

Jerry Lin, Zeyuan Hu, Tom Beucler, Katherine Frields, Hannah Christensen, Walter Hannah, Helge Heuer, Peter Ukkonnen, Laura A. Mansfield, Tian Zheng, Liran Peng, Ritwik Gupta, Pierre Gentine, Yusef Al-Naher, Mingjiang Duan, Kyo Hattori, Weiliang Ji, Chunhan Li, Kippei Matsuda, Naoki Murakami, Shlomo Ron, Marec Serlin, Hongjian Song, Yuma Tanabe, Daisuke Yamamoto, Jianyao Zhou, Mike Pritchard

TL;DR

This work leverages ClimSim, a Kaggle benchmark derived from MMF-based climate modeling, to systematically test offline-trained ML emulators when online-coupled to a climate model. By evaluating six architectures and five design configurations, it demonstrates reproducible online stability across diverse models in a low-resolution, real-geography setting, while revealing persistent offline/online biases and architecture-dependent responses to input expansions. The study shows state-of-the-art online performance on individual metrics without achieving a universal pareto improvement over the prior benchmark, and it highlights universal biases that transcend architecture and seed, suggesting targeted bias penalties or stochastic approaches for future progress. Overall, the authors advocate for democratizing online testing and advancing benchmark developments to bridge the gap between offline skill and robust online climate projections.

Abstract

Subgrid machine-learning (ML) parameterizations have the potential to introduce a new generation of climate models that incorporate the effects of higher-resolution physics without incurring the prohibitive computational cost associated with more explicit physics-based simulations. However, important issues, ranging from online instability to inconsistent online performance, have limited their operational use for long-term climate projections. To more rapidly drive progress in solving these issues, domain scientists and machine learning researchers opened up the offline aspect of this problem to the broader machine learning and data science community with the release of ClimSim, a NeurIPS Datasets and Benchmarks publication, and an associated Kaggle competition. This paper reports on the downstream results of the Kaggle competition by coupling emulators inspired by the winning teams' architectures to an interactive climate model (including full cloud microphysics, a regime historically prone to online instability) and systematically evaluating their online performance. Our results demonstrate that online stability in the low-resolution, real-geography setting is reproducible across multiple diverse architectures, which we consider a key milestone. All tested architectures exhibit strikingly similar offline and online biases, though their responses to architecture-agnostic design choices (e.g., expanding the list of input variables) can differ significantly. Multiple Kaggle-inspired architectures achieve state-of-the-art (SOTA) results on certain metrics such as zonal mean bias patterns and global RMSE, indicating that crowdsourcing the essence of the offline problem is one path to improving online performance in hybrid physics-AI climate simulation.

Crowdsourcing the Frontier: Advancing Hybrid Physics-ML Climate Simulation via a $50,000 Kaggle Competition

TL;DR

This work leverages ClimSim, a Kaggle benchmark derived from MMF-based climate modeling, to systematically test offline-trained ML emulators when online-coupled to a climate model. By evaluating six architectures and five design configurations, it demonstrates reproducible online stability across diverse models in a low-resolution, real-geography setting, while revealing persistent offline/online biases and architecture-dependent responses to input expansions. The study shows state-of-the-art online performance on individual metrics without achieving a universal pareto improvement over the prior benchmark, and it highlights universal biases that transcend architecture and seed, suggesting targeted bias penalties or stochastic approaches for future progress. Overall, the authors advocate for democratizing online testing and advancing benchmark developments to bridge the gap between offline skill and robust online climate projections.

Abstract

Subgrid machine-learning (ML) parameterizations have the potential to introduce a new generation of climate models that incorporate the effects of higher-resolution physics without incurring the prohibitive computational cost associated with more explicit physics-based simulations. However, important issues, ranging from online instability to inconsistent online performance, have limited their operational use for long-term climate projections. To more rapidly drive progress in solving these issues, domain scientists and machine learning researchers opened up the offline aspect of this problem to the broader machine learning and data science community with the release of ClimSim, a NeurIPS Datasets and Benchmarks publication, and an associated Kaggle competition. This paper reports on the downstream results of the Kaggle competition by coupling emulators inspired by the winning teams' architectures to an interactive climate model (including full cloud microphysics, a regime historically prone to online instability) and systematically evaluating their online performance. Our results demonstrate that online stability in the low-resolution, real-geography setting is reproducible across multiple diverse architectures, which we consider a key milestone. All tested architectures exhibit strikingly similar offline and online biases, though their responses to architecture-agnostic design choices (e.g., expanding the list of input variables) can differ significantly. Multiple Kaggle-inspired architectures achieve state-of-the-art (SOTA) results on certain metrics such as zonal mean bias patterns and global RMSE, indicating that crowdsourcing the essence of the offline problem is one path to improving online performance in hybrid physics-AI climate simulation.

Paper Structure

This paper contains 23 sections, 4 equations, 57 figures, 3 tables.

Figures (57)

  • Figure 1: Architecture diagrams for the U-Net from Hu2025-mf, Squeezeformer, Pure ResLSTM, Pao Model, ConvNeXt, and Encoder-Decoder LSTM from the 1st place, 2nd place, 3rd place, 4th place, and 5th place teams in the 2024 LEAP ClimSim Kaggle competition.
  • Figure 2: Offline $R^2$ values for each variable across architectures for the standard configuration (depicted using dashed lines and hatched bar charts) and the expanded variable list configuration. For vertically-resolved variables, the colored lines depict the median $R^2$ while the shading shows the min and max across seeds for each architecture. For scalar variables, the bars show the medians while the vertical lines at the top of each bar show the min-max range.
  • Figure 3: Online monthly RMSE for temperature and moisture for all architectures in each configuration. Shading indicates the inter-seed range (min to max RMSE across three seeds per month after excluding RMSE from hybrid simulations that crash at any point). Dashed lines show RMSE from the seed whose monthly mean absolute deviation from MMF RMSE is closest to the median absolute deviation across seeds. For visual clarity, RMSE for hybrid simulations that crash due to numerical instability are not shown. Subplots i and j only show data up to four years because of out-of-memory issues that caused many hybrid simulations to terminate in the fifth year. Asterisk (*) indicates that survival is assessed via integrating for four, and not five, simulation years.
  • Figure 4: Vertical profiles of online global RMSE across temperature, specific humidity, liquid cloud, ice cloud, zonal wind, and meridional wind for architectures that surpass those of the U-Net shown in Hu2025-mf. Each subplot contains a legend that uses a number and letter to denote the architecture and configuration corresponding to each vertical profile.
  • Figure 5: Online zonal mean bias for the architectures with the lowest global RMSE. In a similar fashion to Figure \ref{['fig:five_year_online_sota_results']}, the architecture and configuration responsible for each zonal mean bias plot is denoted with a number-letter combination where the numbers (0-6) corresponds to the choice of architecture (i.e. U-Net, Squeezeformer, Pure ResLSTM, Pao Model, ConvNeXt, and Encoder-Decoder LSTM, respectively) while the letters (A-E) corresponds to the choice of configuration (i.e. standard, confidence loss, difference loss, multirepresentation, and expanded variable list, respectively).
  • ...and 52 more figures