Table of Contents
Fetching ...

On the Predictive Skill of Artificial Intelligence-based Weather Models for Extreme Events using Uncertainty Quantification

Rodrigo Almeida, Noelia Otero, Miguel-Ángel Fernández-Torres, Jackie Ma

TL;DR

AI-based weather forecasting struggles with uncertainty for extremes. This paper evaluates three deterministic AIWP models (FuXi, GraphCast, SFNO) under initial-condition perturbations to form $50$-member ensembles for the 2022 Pakistan floods and the China heatwave, benchmarked against ERA5 and ENS/AIFSENS. Flow-dependent perturbations, especially Huge Ensembles ($HENS$), improve ensemble realism and probabilistic skill (ROCSS, CRPS) relative to Gaussian perturbations, narrowing the gap with NWP but not closing it. Temperature extremes are more reliably captured than precipitation, highlighting limits tied to subgrid physics. The findings motivate hybrid strategies that integrate flow-dependent perturbations with latent-space uncertainty modeling to enable more trustworthy AI-driven early warnings.

Abstract

Accurate prediction of extreme weather events remains a major challenge for artificial intelligence based weather prediction systems. While deterministic models such as FuXi, GraphCast, and SFNO have achieved competitive forecast skill relative to numerical weather prediction, their ability to represent uncertainty and capture extremes is still limited. This study investigates how state of the art deterministic artificial intelligence based models respond to initial-condition perturbations and evaluates the resulting ensembles in forecasting extremes. Using three perturbation strategies (Gaussian noise, Hemispheric Centered Bred Vectors, and Huge Ensembles), we generate 50 member ensembles for two major events in August 2022: the Pakistan floods and the China heatwave. Ensemble skill is assessed against ERA5 and compared with IFS ENS and the probabilistic AIFSENS model using deterministic and probabilistic metrics. Results show that flow dependent perturbations produce the most realistic ensemble spread and highest probabilistic skill, narrowing but not closing the performance gap with numerical weather prediction ensembles. Across variables, artificial intelligence based weather models capture temperature extremes more effectively than precipitation. These findings demonstrate that input perturbations can extend deterministic models toward probabilistic forecasting, paving the way for approaches that combine flow dependent perturbations with generative or latent-space uncertainty modeling for reliable artificial intelligence-driven early warning systems.

On the Predictive Skill of Artificial Intelligence-based Weather Models for Extreme Events using Uncertainty Quantification

TL;DR

AI-based weather forecasting struggles with uncertainty for extremes. This paper evaluates three deterministic AIWP models (FuXi, GraphCast, SFNO) under initial-condition perturbations to form -member ensembles for the 2022 Pakistan floods and the China heatwave, benchmarked against ERA5 and ENS/AIFSENS. Flow-dependent perturbations, especially Huge Ensembles (), improve ensemble realism and probabilistic skill (ROCSS, CRPS) relative to Gaussian perturbations, narrowing the gap with NWP but not closing it. Temperature extremes are more reliably captured than precipitation, highlighting limits tied to subgrid physics. The findings motivate hybrid strategies that integrate flow-dependent perturbations with latent-space uncertainty modeling to enable more trustworthy AI-driven early warnings.

Abstract

Accurate prediction of extreme weather events remains a major challenge for artificial intelligence based weather prediction systems. While deterministic models such as FuXi, GraphCast, and SFNO have achieved competitive forecast skill relative to numerical weather prediction, their ability to represent uncertainty and capture extremes is still limited. This study investigates how state of the art deterministic artificial intelligence based models respond to initial-condition perturbations and evaluates the resulting ensembles in forecasting extremes. Using three perturbation strategies (Gaussian noise, Hemispheric Centered Bred Vectors, and Huge Ensembles), we generate 50 member ensembles for two major events in August 2022: the Pakistan floods and the China heatwave. Ensemble skill is assessed against ERA5 and compared with IFS ENS and the probabilistic AIFSENS model using deterministic and probabilistic metrics. Results show that flow dependent perturbations produce the most realistic ensemble spread and highest probabilistic skill, narrowing but not closing the performance gap with numerical weather prediction ensembles. Across variables, artificial intelligence based weather models capture temperature extremes more effectively than precipitation. These findings demonstrate that input perturbations can extend deterministic models toward probabilistic forecasting, paving the way for approaches that combine flow dependent perturbations with generative or latent-space uncertainty modeling for reliable artificial intelligence-driven early warning systems.

Paper Structure

This paper contains 22 sections, 5 equations, 12 figures, 1 table.

Figures (12)

  • Figure 1: Overview of the study methodology, including the initial condition perturbation, forecasting with AIWP models, and evaluation of results stages.
  • Figure 2: Hemispheric Centered Bred Vector (HCBV) perturbation method. $\Delta z500$ represents a correlated spherical Gaussian noise added to the 500 hPa geopotential variable. $h$ represents the norm or size computed separately for the north and south hemispheres, and interpolated in the tropics. $d$ represents the integration depth, that is, the number of recursive cycles from which the final perturbation is computed. This perturbation is additionally centered, so the perturbation vector is alternatively added and subtracted from the initial conditions. Diagram adapted from bano-medinaCalibratedEnsemblesNeural2025.
  • Figure 3: Analysis of the August 2022 Pakistan extreme precipitation and China heatwave, based on ERA5. The values displayed (in the spatial and temporal dimensions) correspond to the exceedance of the 99th percentile of the ERA5 1990-2020 climatology for daily total precipitation and daily maximum temperature. The geographical bounds of each case study are also shown. A maximum of 200 mm of precipitation exceedance can be appreciated in Pakistan, while a sustained 10 K exceedance over August is observed in China, further reinforcing the extreme nature of both these events.
  • Figure 4: ROCSS values at the 99th percentile of the ERA5 1990-2020 climatology for daily accumulated precipitation ($TP_{24h}$) and maximum daily temperature ($T2M_{24h}$), across 10 lead times, for the different AIWP models and their ensembles, and the Pakistan and China region in August 2022. The symbols under the values represent significantly higher or lower differences in the metric compared with ENS ($p<0.05$), while the symbol size represents the magnitude of the difference. The higher the ROCSS values, the better the performance on the extremes. HENS perturbations achieve the highest ROCSS across all models ensembles.
  • Figure 5: Daily accumulated precipitation spread for the different ensemble models (ENS, AIWPs) over Pakistan on 18th August 2022 for a 3-day lead time forecast. Ground-truth ERA5 spatial distribution for daily precipitation is shown in the top left corner. Daily average geopotential height at 500 hPa is also displayed as contour lines.
  • ...and 7 more figures