Benchmarking atmospheric circulation variability in an AI emulator, ACE2, and a hybrid model, NeuralGCM
Ian Baxter, Hamid Pahlavan, Pedram Hassanzadeh, Katharine Rucker, Tiffany Shaw
TL;DR
This work benchmarks AI-based atmospheric emulation against ERA5 and CMIP6 AMIP using four dynamical metrics that span tropical and extratropical variability. ACE2-ERA5 (data-driven) and NeuralGCM (hybrid) reproduce short- to mid-range variability such as tropical convective waves and eddy–mean flow interactions, but struggle with slower modes including the quasi-biennial oscillation and the Southern Annular Mode, likely due to training focuses on fast dynamics and limited stratospheric resolution. The findings suggest AI models can learn key dynamical interactions yet require improved representation of gravity-wave processes and stratospheric dynamics to capture longer-timescale variability, which is crucial for robust climate projections and out-of-distribution applications. Overall, the study highlights both the promise and current limitations of AI emulators and hybrid models in faithfully representing atmospheric variability across multiple timescales, guiding future improvements in training objectives, vertical resolution, and dynamical benchmarking frameworks.
Abstract
Physics-based atmosphere-land models with prescribed sea surface temperature have notable successes but also biases in their ability to represent atmospheric variability compared to observations. Recently, AI emulators and hybrid models have emerged with the potential to overcome these biases, but still require systematic evaluation against metrics grounded in fundamental atmospheric dynamics. Here, we evaluate the representation of four atmospheric variability benchmarking metrics in a fully data-driven AI emulator (ACE2-ERA5) and hybrid model (NeuralGCM). The hybrid model and emulator can capture the spectra of large-scale tropical waves and extratropical eddy-mean flow interactions, including critical levels. However, both struggle to capture the timescales associated with quasi-biennial oscillation (QBO, $\sim 28$ months) and Southern annular mode propagation ($\sim 150$ days). These dynamical metrics serve as an initial benchmarking tool to inform AI model development and understand their limitations, which may be essential for out-of-distribution applications (e.g., extrapolating to unseen climates).
