Table of Contents
Fetching ...

Benchmarking atmospheric circulation variability in an AI emulator, ACE2, and a hybrid model, NeuralGCM

Ian Baxter, Hamid Pahlavan, Pedram Hassanzadeh, Katharine Rucker, Tiffany Shaw

TL;DR

This work benchmarks AI-based atmospheric emulation against ERA5 and CMIP6 AMIP using four dynamical metrics that span tropical and extratropical variability. ACE2-ERA5 (data-driven) and NeuralGCM (hybrid) reproduce short- to mid-range variability such as tropical convective waves and eddy–mean flow interactions, but struggle with slower modes including the quasi-biennial oscillation and the Southern Annular Mode, likely due to training focuses on fast dynamics and limited stratospheric resolution. The findings suggest AI models can learn key dynamical interactions yet require improved representation of gravity-wave processes and stratospheric dynamics to capture longer-timescale variability, which is crucial for robust climate projections and out-of-distribution applications. Overall, the study highlights both the promise and current limitations of AI emulators and hybrid models in faithfully representing atmospheric variability across multiple timescales, guiding future improvements in training objectives, vertical resolution, and dynamical benchmarking frameworks.

Abstract

Physics-based atmosphere-land models with prescribed sea surface temperature have notable successes but also biases in their ability to represent atmospheric variability compared to observations. Recently, AI emulators and hybrid models have emerged with the potential to overcome these biases, but still require systematic evaluation against metrics grounded in fundamental atmospheric dynamics. Here, we evaluate the representation of four atmospheric variability benchmarking metrics in a fully data-driven AI emulator (ACE2-ERA5) and hybrid model (NeuralGCM). The hybrid model and emulator can capture the spectra of large-scale tropical waves and extratropical eddy-mean flow interactions, including critical levels. However, both struggle to capture the timescales associated with quasi-biennial oscillation (QBO, $\sim 28$ months) and Southern annular mode propagation ($\sim 150$ days). These dynamical metrics serve as an initial benchmarking tool to inform AI model development and understand their limitations, which may be essential for out-of-distribution applications (e.g., extrapolating to unseen climates).

Benchmarking atmospheric circulation variability in an AI emulator, ACE2, and a hybrid model, NeuralGCM

TL;DR

This work benchmarks AI-based atmospheric emulation against ERA5 and CMIP6 AMIP using four dynamical metrics that span tropical and extratropical variability. ACE2-ERA5 (data-driven) and NeuralGCM (hybrid) reproduce short- to mid-range variability such as tropical convective waves and eddy–mean flow interactions, but struggle with slower modes including the quasi-biennial oscillation and the Southern Annular Mode, likely due to training focuses on fast dynamics and limited stratospheric resolution. The findings suggest AI models can learn key dynamical interactions yet require improved representation of gravity-wave processes and stratospheric dynamics to capture longer-timescale variability, which is crucial for robust climate projections and out-of-distribution applications. Overall, the study highlights both the promise and current limitations of AI emulators and hybrid models in faithfully representing atmospheric variability across multiple timescales, guiding future improvements in training objectives, vertical resolution, and dynamical benchmarking frameworks.

Abstract

Physics-based atmosphere-land models with prescribed sea surface temperature have notable successes but also biases in their ability to represent atmospheric variability compared to observations. Recently, AI emulators and hybrid models have emerged with the potential to overcome these biases, but still require systematic evaluation against metrics grounded in fundamental atmospheric dynamics. Here, we evaluate the representation of four atmospheric variability benchmarking metrics in a fully data-driven AI emulator (ACE2-ERA5) and hybrid model (NeuralGCM). The hybrid model and emulator can capture the spectra of large-scale tropical waves and extratropical eddy-mean flow interactions, including critical levels. However, both struggle to capture the timescales associated with quasi-biennial oscillation (QBO, months) and Southern annular mode propagation ( days). These dynamical metrics serve as an initial benchmarking tool to inform AI model development and understand their limitations, which may be essential for out-of-distribution applications (e.g., extrapolating to unseen climates).

Paper Structure

This paper contains 16 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: Monthly zonal-mean zonal wind averaged over 10$^{\circ}$N–10$^{\circ}$S from ERA5 (black curves in all panels), (a) two AMIP models (orange curves, solid: IPSL-CM6-LR, dashed: CESM2), (b) one ACE2-ERA5 lagged ensemble member (red curve), and (c) two NeuralGCM2.8 lagged ensemble members (blue curves). Panels a and c show the zonal wind at 50 hPa while panel b shows the average over the vertical layer centered near 50 hPa. The testing periods for each product and the validation period to determine the optimal checkpoint for ACE2-ERA5 are marked in gray shading. See Figure \ref{['fig:Sup. Figure 1']} for the peridiocities and amplitudes across each ensemble and Figure \ref{['fig:Sup. Figure 2']} for the vertical propagation of zonal winds.
  • Figure 2: Wavenumber-frequency power spectrum of the symmetric (left column) and antisymmetric (right column) components of daily mean precipitation from (a-b) ERA5, (c-d) one AMIP model (CESM2-WACCM), (e-f) ACE2-ERA5, and (g-h) NeuralGCM. See Figure \ref{['fig:Sup. Figure 3']} for the background spectra.
  • Figure 3: Contours of 250 hPa transient eddy momentum flux versus latitude and phase speed for DJFM (left column) and JJAS (right column) from (a-b) ERA5, (c-d) one AMIP model (CESM2-WACCM), (e-f) ACE2-ERA5 (37 member mean), and (g-h) NeuralGCM2.8 (37 member mean). Shading intervals are 0.50 $\mathrm{m^{2}\ s^{-2} \cdot \Delta c^{-1}}$. The eddy momentum fluxes are normalized by phase speed bin size ($\Delta c$). Purple contours in each panel denote seasonally averaged mean zonal wind. See Figure \ref{['fig:Sup. Figure 4']} for eddy momentum flux convergence.
  • Figure 4: Frequency (cycles per day) power spectra for 80$\mathrm{^{\circ{}}}$S to 20$\mathrm{^{\circ{}}}$S zonal mean zonal wind anomalies projected onto the first leading EOF mode ($z_{1}$) from (a) ERA5, (b) AMIP, (c) ACE2-ERA5, and (d) NeuralGCM. Thick lines represent ensemble means and thin lines represent individual models or realizations. The dashed blue line and shaded region denote the red noise curve and its 95% confidence interval. The vertical red line denotes the 150-day periodicity associated with the SAM.
  • Figure S1: (a) Power spectra of 50-hPa zonal-mean zonal wind in the tropics ($\mathrm{10^{\circ}S-10^{\circ}N}$) from ERA5 (black), AMIP (orange), NeuralGCM2.8 (blue), and ACE2-ERA5 (red). The power spectra for each ensemble member is computed individually, and the ensemble average is shown in panel (a). (b) Maximum zonal-mean zonal wind amplitude (maximum minus minimum zonal mean wind) versus each model/realization's dominant period.
  • ...and 5 more figures