Table of Contents
Fetching ...

Can AI weather models predict out-of-distribution gray swan tropical cyclones?

Y. Qiang Sun, Pedram Hassanzadeh, Mohsen Zand, Ashesh Chattopadhyay, Jonathan Weare, Dorian S. Abbot

TL;DR

An AI weather model is trained after removing Category 3–5 tropical cyclones from its training set and test it on Category 5 storms, which shows promise in learning from strong storms in one region and forecasting them in another region.

Abstract

Predicting gray swan weather extremes, which are possible but so rare that they are absent from the training dataset, is a major concern for AI weather models and long-term climate emulators. An important open question is whether AI models can extrapolate from weaker weather events present in the training set to stronger, unseen weather extremes. To test this, we train independent versions of the AI model FourCastNet on the 1979-2015 ERA5 dataset with all data, or with Category 3-5 tropical cyclones (TCs) removed, either globally or only over the North Atlantic or Western Pacific basin. We then test these versions of FourCastNet on 2018-2023 Category 5 TCs (gray swans). All versions yield similar accuracy for global weather, but the one trained without Category 3-5 TCs cannot accurately forecast Category 5 TCs, indicating that these models cannot extrapolate from weaker storms. The versions trained without Category 3-5 TCs in one basin show some skill forecasting Category 5 TCs in that basin, suggesting that FourCastNet can generalize across tropical basins. This is encouraging and surprising because regional information is implicitly encoded in inputs. Given that current state-of-the-art AI weather and climate models have similar learning strategies, we expect our findings to apply to other models. Other types of weather extremes need to be similarly investigated. Our work demonstrates that novel learning strategies are needed for AI models to reliably provide early warning or estimated statistics for the rarest, most impactful TCs, and, possibly, other weather extremes.

Can AI weather models predict out-of-distribution gray swan tropical cyclones?

TL;DR

An AI weather model is trained after removing Category 3–5 tropical cyclones from its training set and test it on Category 5 storms, which shows promise in learning from strong storms in one region and forecasting them in another region.

Abstract

Predicting gray swan weather extremes, which are possible but so rare that they are absent from the training dataset, is a major concern for AI weather models and long-term climate emulators. An important open question is whether AI models can extrapolate from weaker weather events present in the training set to stronger, unseen weather extremes. To test this, we train independent versions of the AI model FourCastNet on the 1979-2015 ERA5 dataset with all data, or with Category 3-5 tropical cyclones (TCs) removed, either globally or only over the North Atlantic or Western Pacific basin. We then test these versions of FourCastNet on 2018-2023 Category 5 TCs (gray swans). All versions yield similar accuracy for global weather, but the one trained without Category 3-5 TCs cannot accurately forecast Category 5 TCs, indicating that these models cannot extrapolate from weaker storms. The versions trained without Category 3-5 TCs in one basin show some skill forecasting Category 5 TCs in that basin, suggesting that FourCastNet can generalize across tropical basins. This is encouraging and surprising because regional information is implicitly encoded in inputs. Given that current state-of-the-art AI weather and climate models have similar learning strategies, we expect our findings to apply to other models. Other types of weather extremes need to be similarly investigated. Our work demonstrates that novel learning strategies are needed for AI models to reliably provide early warning or estimated statistics for the rarest, most impactful TCs, and, possibly, other weather extremes.

Paper Structure

This paper contains 16 sections, 2 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Schematic overview of this study.a) Training of five versions of FourCastNet. The panel depicts the histogram of minimum mslp in the tropics (30°S–30°N) in the training set (ERA5, 1979-2015). Note that a lower mslp corresponds to a stronger TC. Vertical lines indicate the 5th and 25th percentiles, which are 970 and 988 hPa, respectively. For FourCastNet-Full, the full training dataset is utilized. For FourCastNet-noTC, samples with instances of mslp below $988.0$ hPa anywhere in the tropics are removed from the training set. FourCastNet-Rand uses a training set of the same size and seasonal distribution as noTC but with samples removed randomly (while ensuring that samples with mslp $<988.0$ hPa are retained). Two additional models are also trained for which samples below 988 hPa only over the tropical Western Pacific (noWP) or tropical North Atlantic (noNA) basin are removed. For each training set, five independent versions (realizations) are trained from different random weight/bias initializations to account for model uncertainty. b) Testing of the five models. The forecast skill of each trained model is evaluated for TCs with mslp below 970 hPa (Category 5) in the test set. The right panels provide an example of the forecast results for Hurricane Lee (2023), a Category 5 TC. Shading represents the 25th to 75th percentile range of forecasts, derived from five model realizations and 51 different initial conditions (ICs) provided by an ensemble of data assimilations (EDA) from ECMWF; See Methods and Data.
  • Figure 2: FourCastNet's difficulty in extrapolating to gray swan TCs. Forecasting of all 20 Category 5 TCs from the test set (2018-2023) by three versions of FourCastNet trained on different datasets: FourCastNet-Full (left column), FourCastNet-Rand (middle column), and FourCastNet-noTC (left column). Dashed line shows the critical threshold for 25th percentile of minimum mslp (roughly Category 3 TC) used in the noTC training set. All panels show the evolution of the median mslp (solid line) and the inter-quartile range from the 25 to the 75th percentile (shading) over all 20 Category 5 TCs, 5 realizations of each trained model, and 51 perturbed initial conditions from EDA (5100 forecasts). Shading for ERA5 is over the 20 TCs. Forecasts are initialized one day before each TC reached the critical threshold (weak phase, top row) or one day after the TC reached this threshold (strong phase, bottom row). The latter initial conditions are out-of-distribution. As an additional note, detailed analysis shows that none of the ensemble members in the FoureCastNet-noTC forecasts reached the observed lowest mslp values. Although a few members' mslp reached 970 hPa, this occurred because these members transitioned to an unstable state that eventually led to blow-up, rather than capturing realistic intensification of the storm.
  • Figure 3: Extratropical cyclones and TCs exhibit different dynamical behavior.a) Joint PDF of mslp and 10-meter winds in the tropics (30°S–30°N) in the Full training set. b) Probability density of 10-meter winds in the tropics in the Full training set, conditioned on the mslp threshold. c) Similar to (a), but for the midlatitudes (40°–60°N) of the noTC training set. d) Similar to (b), but for the midlatitudes of the noTC training set.
  • Figure 4: FourCastNet generalizes across tropical regions for dynamically similar events.a) Comparison of the forecast skill of FourCastNet-noWP against other models for Category 5 TCs (from the test set) in the Western Pacific, initialized at the TC's weak phase. b) As in (a), but for TCs in the North Atlantic basin. c)-(d) As in (a)-(b), but initialized at the strong phase of the TCs. Solid lines and shading are as in Figure \ref{['fig:full-rand-notc']}.
  • Figure 5: Lack of physical consistency in the forecasts.a) Gradient-wind balance in ERA5. Radial profiles of azimuthal wind and the gradient-wind derived from Eq. \ref{['eq:GradWind']} at 500 hPa for all Category 5 TCs in the test set (2018-2023) in their weak phase. b) As in (a), but for FourCastNet-Full's forecasts. (c) As in (b), but for FourCastNet-noTC's forecasts. The bottom row is the same as the top row, but for the strong phase of Category 5 TCs. The shading indicates the 25th to 75th percentile range across the 20 Category 5 TCs in the test (left panel). In the middle and right panels, the shading is over the 20 TCs, 5 realizations, and 51 perturbed ICs.
  • ...and 5 more figures