Table of Contents
Fetching ...

Benchmarking Regional Thermodynamic Trends in an AI emulator, ACE2, and a hybrid model, NeuralGCM

Katharine Rucker, Ian Baxter, Pedram Hassanzadeh, Tiffany A. Shaw, Hamid A. Pahlavan

TL;DR

The study benchmarks two AI-driven climate emulators, ACE2 and NeuralGCM, against ERA5 and physics-based AMIP ensembles to test their ability to reproduce satellite-era regional thermodynamic trends. ACE2 is fully data-driven, while NeuralGCM blends a dynamical core with neural subgrid parameterizations; both are evaluated on trends across latitude bands, vertical structure, extremes, and drying for the period $1981$–$2014$. The results show AI models can match or exceed physics-based models in several regional signals, notably Arctic Amplification and tropical upper-tropospheric warming, with ACE2 often performing best in vertical extratropical structures. However, neither AI nor physics-based models consistently capture heat-extreme or drying trends in the US Southwest or other arid regions, underscoring the importance of explicit land representation and land-atmosphere coupling for regional climate-task skill and the potential of AI to complement traditional models when augmented with land processes.

Abstract

AI models have emerged as potential complements to physics-based models, but their skill in capturing observed regional climate trends with important societal impacts has not been explored. Here, we benchmark satellite-era regional thermodynamic trends, including extremes, in an AI emulator (ACE2) and a hybrid model (NeuralGCM). We also compare the AI models' skill to physics-based land-atmosphere models. Both AI models show skill in capturing regional temperature trends such as Arctic Amplification. ACE2 outperforms other models in capturing vertical temperature trends in the midlatitudes. However, the AI models do not capture regional trends in heat extremes over the US Southwest. Furthermore, they do not capture drying trends in arid regions, even though they generally perform better than physics-based models. Our results also show that a data-driven AI emulator can perform comparably to, or better than, hybrid and physics-based models in capturing regional thermodynamic trends.

Benchmarking Regional Thermodynamic Trends in an AI emulator, ACE2, and a hybrid model, NeuralGCM

TL;DR

The study benchmarks two AI-driven climate emulators, ACE2 and NeuralGCM, against ERA5 and physics-based AMIP ensembles to test their ability to reproduce satellite-era regional thermodynamic trends. ACE2 is fully data-driven, while NeuralGCM blends a dynamical core with neural subgrid parameterizations; both are evaluated on trends across latitude bands, vertical structure, extremes, and drying for the period . The results show AI models can match or exceed physics-based models in several regional signals, notably Arctic Amplification and tropical upper-tropospheric warming, with ACE2 often performing best in vertical extratropical structures. However, neither AI nor physics-based models consistently capture heat-extreme or drying trends in the US Southwest or other arid regions, underscoring the importance of explicit land representation and land-atmosphere coupling for regional climate-task skill and the potential of AI to complement traditional models when augmented with land processes.

Abstract

AI models have emerged as potential complements to physics-based models, but their skill in capturing observed regional climate trends with important societal impacts has not been explored. Here, we benchmark satellite-era regional thermodynamic trends, including extremes, in an AI emulator (ACE2) and a hybrid model (NeuralGCM). We also compare the AI models' skill to physics-based land-atmosphere models. Both AI models show skill in capturing regional temperature trends such as Arctic Amplification. ACE2 outperforms other models in capturing vertical temperature trends in the midlatitudes. However, the AI models do not capture regional trends in heat extremes over the US Southwest. Furthermore, they do not capture drying trends in arid regions, even though they generally perform better than physics-based models. Our results also show that a data-driven AI emulator can perform comparably to, or better than, hybrid and physics-based models in capturing regional thermodynamic trends.

Paper Structure

This paper contains 15 sections, 2 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: (a) Latitude-weighted zonal-mean 850-hPa temperature trend (1981-2014) in different latitudinal bands for physics-based models, NeuralGCM, ACE2, and ERA5. Open pink circle represents the CAM6 ensemble. Grey vertical lines represent the 5-95% ensemble spread. Black horizontal line is the ERA5 trend. (b) Ensemble-mean annual time series of latitude-weighted 850-hPa temperature averaged over 60N-90N, normalized by the 1981-1990 mean of the respective model. Backslash hatching represents the ACE2 testing period, and forward-slash hatching represents the NeuralGCM testing period. See Figs. S2 and S3 for the same analysis but over land and over ocean, respectively.
  • Figure 2: (a)-(c) Tropical temperature trends: Temporal (1981-2014) and spatial (20S-20N) average of vertical temperature trends from physics-based models, ACE2, and NeuralGCM compared with ERA5. Shading shows the 5-95% ensemble spread. (d)-(f) Same as (a)-(c) but for the Northern Hemisphere extratropics (20N-60N). See Fig. S4 for the Southern Hemisphere extratropics.
  • Figure 3: 2-meter TXx trends for ERA5, physics-based models, and ACE2, and 1000-hPa TXx trends for NeuralGCM (2-meter data are not available). First row: Spatial TXx trends over Western Europe. Second row: (d) Model ensemble spread of TXx trend averaged over Western Europe region outlined in black in the first row (1981-2014). Open circle represents CAM6. Grey vertical line represents 5-95% spread. (e) Ensemble mean time series of latitude-weighted average TXx for Western Europe normalized by 1981-1990 average. Third row: As in first row, but for Midwest US. Fourth row: As in second row, but for Midwest US. The t-test did not show statistically significant tests for ERA5 and NeuralGCM. Last row: As in second row, but for Southwest US.
  • Figure 4: 700-hPa Specific humidity (q) trends by region. First row: Spatial specific humidity trend for ERA5, ACE2, and NeuralGCM for Southwest US. Second row: (d) Model ensemble spread of humidity trend averaged over Southwest US region outlined in black (1981-2014). Open circle represents the CAM6 ensemble. Grey vertical line represents 5-95% ensemble spread. The t-test did not show statistically significant trends for physics-based models or NeuralGCM. (e) Ensemble mean annual time series of latitude-weighted average specific humidity for Southwest US normalized by 1981-1990 average. Backslash hatching represents the ACE2 testing period and forward slash hatching represents the NeuralGCM testing. Third row: As in first row, but for South America. Last row: As in second row, but for South America. The t-test did not show statistically significant trends for physics-based models or NeuralGCM.
  • Figure S1: Comparison of normalized global temperature at 850-hPa from ACE2 direct output and at 850-hPa interpolated from ACE2's hybrid sigma levels.
  • ...and 8 more figures