WeatherBench 2: A benchmark for the next generation of data-driven global weather models

Stephan Rasp; Stephan Hoyer; Alexander Merose; Ian Langmore; Peter Battaglia; Tyler Russel; Alvaro Sanchez-Gonzalez; Vivian Yang; Rob Carver; Shreya Agrawal; Matthew Chantry; Zied Ben Bouallegue; Peter Dueben; Carla Bromberg; Jared Sisk; Luke Barrington; Aaron Bell; Fei Sha

WeatherBench 2: A benchmark for the next generation of data-driven global weather models

Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russel, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallegue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell, Fei Sha

TL;DR

WeatherBench 2 advances data-driven global weather forecasting by delivering an open, end-to-end benchmark aligned with operational verification practices. It combines ERA5-based ground truth, IFS-based baselines, and a diverse set of AI models (eg, GraphCast, Pangu-Weather, NeuralGCM) and evaluates them with deterministic metrics (RMSE, ACC, Bias, SEEPS), probabilistic metrics (CRPS, spread-skill), and energy spectra, all at a common 1.5° evaluation grid. The framework emphasizes probabilistic forecasting, fair cross-model comparisons, and transparency through open data and code, while acknowledging ERA5 limitations and the need for post-processing and extremes evaluation. The results provide a snapshot of current state-of-the-art performance and establish a path toward reproducible, community-driven progress in data-driven weather forecasting.

Abstract

WeatherBench 2 is an update to the global, medium-range (1-14 day) weather forecasting benchmark proposed by Rasp et al. (2020), designed with the aim to accelerate progress in data-driven weather modeling. WeatherBench 2 consists of an open-source evaluation framework, publicly available training, ground truth and baseline data as well as a continuously updated website with the latest metrics and state-of-the-art models: https://sites.research.google/weatherbench. This paper describes the design principles of the evaluation framework and presents results for current state-of-the-art physical and data-driven weather models. The metrics are based on established practices for evaluating weather forecasts at leading operational weather centers. We define a set of headline scores to provide an overview of model performance. In addition, we also discuss caveats in the current evaluation setup and challenges for the future of data-driven weather forecasting.

WeatherBench 2: A benchmark for the next generation of data-driven global weather models

TL;DR

Abstract

Paper Structure (43 sections, 16 equations, 21 figures, 3 tables)

This paper contains 43 sections, 16 equations, 21 figures, 3 tables.

Introduction
Design decisions for WeatherBench 2
Data, baselines and data-driven models
ERA5
ERA5 forecasts
Climatology
IFS HRES
IFS HRES Initial Conditions
IFS ENS
IFS ENS Mean
Keisler (2022) Graph Neural Network
Pangu-Weather
Pangu-Weather (operational)
GraphCast
GraphCast (operational)
...and 28 more sections

Figures (21)

Figure 1: Deterministic headline scorecards for upper-level variables. Values show absolute RMSE. Colors denote % difference to the IFS HRES baseline.
Figure 2: Deterministic headline scorecards for surface variables. Values show absolute RMSE, with the exception of precipitation which shows SEEPS (evaluated against ERA5 in all cases). Colors denote % difference to the IFS HRES baseline.
Figure 3: Probabilistic headline scorecards for upper-level variables. Values show absolute CRPS. Colors denote % difference to the IFS ENS baseline.
Figure 4: Global RMSE (SEEPS for TP24h) for headline variables for the year 2020. Note that for TP24h, IFS HRES and IFS ENS (mean) are evaluated against ERA5, since no precipitation accumulations are available for the analysis. Not all models/datasets have all variables available.
Figure 5: Global RMSE/SEEPS % difference compared to IFS HRES for headline variables for the year 2020. Negative values indicated lower RMSE. Note that for TP24h, IFS HRES and IFS ENS (mean) are evaluated against ERA5, since no precipitation accumulations are available for the analysis, and the metric is SEEPS (in this case not 1-SEEPS to have a consistent orientation with the other relative plots). Not all models/datasets have all variables available.
...and 16 more figures

WeatherBench 2: A benchmark for the next generation of data-driven global weather models

TL;DR

Abstract

WeatherBench 2: A benchmark for the next generation of data-driven global weather models

Authors

TL;DR

Abstract

Table of Contents

Figures (21)