A Practical Probabilistic Benchmark for AI Weather Models

Noah D. Brenowitz; Yair Cohen; Jaideep Pathak; Ankur Mahesh; Boris Bonev; Thorsten Kurth; Dale R. Durran; Peter Harrington; Michael S. Pritchard

A Practical Probabilistic Benchmark for AI Weather Models

Noah D. Brenowitz, Yair Cohen, Jaideep Pathak, Ankur Mahesh, Boris Bonev, Thorsten Kurth, Dale R. Durran, Peter Harrington, Michael S. Pritchard

TL;DR

The paper tackles the challenge of fairly benchmarking probabilistic skill for AI weather forecasts by introducing lagged ensemble forecasting (LEF), a parameter-free approach that builds probabilistic ensembles from deterministic hindcasts. Applying LEF, the authors compare GraphCast, PanguWeather, and a conventional IFS baseline, revealing that GraphCast and Pangu achieve similar probabilistic performance despite differing deterministic strengths, and that multi-step training can harm ensemble calibration. Ablation studies with SFNO show that long-lead-time autoregressive training reduces ensemble spread without proportional gains in probabilistic skill, while effective resolution via the scale factor modulates dispersion. The LEF framework provides a practical, scalable tool for end-to-end probabilistic evaluation and cross-model comparisons, supported by open-source software and shared data baselines.

Abstract

Since the weather is chaotic, forecasts aim to predict the distribution of future states rather than make a single prediction. Recently, multiple data driven weather models have emerged claiming breakthroughs in skill. However, these have mostly been benchmarked using deterministic skill scores, and little is known about their probabilistic skill. Unfortunately, it is hard to fairly compare AI weather models in a probabilistic sense, since variations in choice of ensemble initialization, definition of state, and noise injection methodology become confounding. Moreover, even obtaining ensemble forecast baselines is a substantial engineering challenge given the data volumes involved. We sidestep both problems by applying a decades-old idea -- lagged ensembles -- whereby an ensemble can be constructed from a moderately-sized library of deterministic forecasts. This allows the first parameter-free intercomparison of leading AI weather models' probabilistic skill against an operational baseline. The results reveal that two leading AI weather models, i.e. GraphCast and Pangu, are tied on the probabilistic CRPS metric even though the former outperforms the latter in deterministic scoring. We also reveal how multiple time-step loss functions, which many data-driven weather models have employed, are counter-productive: they improve deterministic metrics at the cost of increased dissipation, deteriorating probabilistic skill. This is confirmed through ablations applied to a spherical Fourier Neural Operator (SFNO) approach to AI weather forecasting. Separate SFNO ablations modulating effective resolution reveal it has a useful effect on ensemble dispersion relevant to achieving good ensemble calibration. We hope these and forthcoming insights from lagged ensembles can help guide the development of AI weather forecasts and have thus shared the diagnostic code.

A Practical Probabilistic Benchmark for AI Weather Models

TL;DR

Abstract

Paper Structure (20 sections, 11 equations, 6 figures, 1 table)

This paper contains 20 sections, 11 equations, 6 figures, 1 table.

Introduction
Lagged Ensemble Benchmark
Results
Comparing a lagged and operational ensemble
Comparing data-driven and NWP models with the same ensemble method
The effect of long-lead time training
Sensitivity to effective resolution
Discussion and Conclusions
Software
Data
Model Weights
Supplemental Information
Lorenz 1963 Dynamical System
Numerical models
Data-driven models
...and 5 more sections

Figures (6)

Figure 1: Overview of lagged ensemble forecasting. (a) A schematic of the method. Each ensemble member (color) is initialized at a different initial time (dots). The true time series (-) and the lagged ensemble average (- -) are also shown. (b) Comparison between the global mean CRPS of 500hpa height forecasts from a lagged ensemble and the IFS operational ensemble at a lead time of 5 days. Each dot shows a different initial time. The units of each axes are m^2/s^2.
Figure 2: Z500 and T850 skill comparison for several deterministic forecast systems. From left to right, deterministic RMSE (dRMSE), ensemble RMSE (eRMSE), ensemble spread and CRPS. Ensemble scores are only valid between 2 and 8 days.
Figure 3: Using LEF to explore the effect of fine-tuning. Deterministic RMSE (a), ensemble mean RMSE (b) and CRPS (c) as a function of lead time. Spread as a function of ensemble mean RMSE (d). The spread is multiplied by a factor involving the ensemble size $R=2M+1=9$ (see Appendix \ref{['sec:metrics']}). The color lines show a range of fine tuning steps and the dashed line shows the ifs. The field being scored is temperature at 850 hPa (T850). Unlike Figure \ref{['fig:comparisons']} the typical biased estimator of CRPS is used (\ref{['eq:crps-biased']}).
Figure 4: Sensitivity of the ensemble spread to changes in hyperparameters. (a-d) show results for various fields and differing amounts of auto-regressive fine tuning. (e-h) shows the same, but with a varying scale factor. The bias-corrected spread error ratio is defined by $\text{SER}=\sqrt{\frac{R+1}{R}}\frac{\text{Spread}}{\text{eRMSE}}$.
Figure : Figure A1. Same as \ref{['fig:schematic']}b but for differing numbers of lags $M=1,2,3,4$ and for two channels, u10m and z500. The correlation between the lagged and CRPS is similar in all cases. The lagged CRPS has similar quantitative values to the ENS CRPS for $M=1$
...and 1 more figures

A Practical Probabilistic Benchmark for AI Weather Models

TL;DR

Abstract

A Practical Probabilistic Benchmark for AI Weather Models

Authors

TL;DR

Abstract

Table of Contents

Figures (6)