Deep Learning Atmospheric Models Reliably Simulate Out-of-Sample Land Heat and Cold Wave Frequencies
Zilu Meng, Gregory J. Hakim, Wenchang Yang, Gabriel A. Vecchi
TL;DR
The paper evaluates two deep-learning atmospheric models (DLESyM and NGCM) against a conventional physical GCM (HiRAM) for simulating land heatwaves and coldwaves under AMIP forcing from 1900–2020, with a focus on out-of-sample 1900–1960. It shows that both DL-based models can generalize to unseen climate conditions with skill comparable to HiRAM, though their performance is regionally variable and influenced by the degree of temporal autocorrelation in surface temperatures; DLESyM overestimates extremes due to high autocorrelation, while NGCM aligns more closely with HiRAM. A simple linear baseline and multiple verification datasets (20CRv3, ERA5, BE, HadISST) help attribute discrepancies to forcing and data limitations. The findings highlight that model architecture—especially how physical constraints shape persistence—matters for extreme-event frequency estimates and point to DL-based GCMs as fast, scalable complements to traditional climate models, offering large ensembles for robust uncertainty quantification.
Abstract
Deep learning (DL)-based general circulation models (GCMs) are emerging as fast simulators, yet their ability to replicate extreme events outside their training range remains unknown. Here, we evaluate two such models -- the hybrid Neural General Circulation Model (NGCM) and purely data-driven Deep Learning Earth System Model (DL\textit{ESy}M) -- against a conventional high-resolution land-atmosphere model (HiRAM) in simulating land heatwaves and coldwaves. All models are forced with observed sea surface temperatures and sea ice over 1900-2020, focusing on the out-of-sample early-20th-century period (1900-1960). Both DL models generalize successfully to unseen climate conditions, broadly reproducing the frequency and spatial patterns of heatwave and cold wave events during 1900-1960 with skill comparable to HiRAM. An exception is over portions of North Asia and North America, where all models perform poorly during 1940-1960. Due to excessive temperature autocorrelation, DL\textit{ESy}M tends to overestimate heatwave and cold wave frequencies, whereas the physics-DL hybrid NGCM exhibits persistence more similar to HiRAM.
