Table of Contents
Fetching ...

WeatherBench 2: A benchmark for the next generation of data-driven global weather models

Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russel, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallegue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell, Fei Sha

TL;DR

WeatherBench 2 advances data-driven global weather forecasting by delivering an open, end-to-end benchmark aligned with operational verification practices. It combines ERA5-based ground truth, IFS-based baselines, and a diverse set of AI models (eg, GraphCast, Pangu-Weather, NeuralGCM) and evaluates them with deterministic metrics (RMSE, ACC, Bias, SEEPS), probabilistic metrics (CRPS, spread-skill), and energy spectra, all at a common 1.5° evaluation grid. The framework emphasizes probabilistic forecasting, fair cross-model comparisons, and transparency through open data and code, while acknowledging ERA5 limitations and the need for post-processing and extremes evaluation. The results provide a snapshot of current state-of-the-art performance and establish a path toward reproducible, community-driven progress in data-driven weather forecasting.

Abstract

WeatherBench 2 is an update to the global, medium-range (1-14 day) weather forecasting benchmark proposed by Rasp et al. (2020), designed with the aim to accelerate progress in data-driven weather modeling. WeatherBench 2 consists of an open-source evaluation framework, publicly available training, ground truth and baseline data as well as a continuously updated website with the latest metrics and state-of-the-art models: https://sites.research.google/weatherbench. This paper describes the design principles of the evaluation framework and presents results for current state-of-the-art physical and data-driven weather models. The metrics are based on established practices for evaluating weather forecasts at leading operational weather centers. We define a set of headline scores to provide an overview of model performance. In addition, we also discuss caveats in the current evaluation setup and challenges for the future of data-driven weather forecasting.

WeatherBench 2: A benchmark for the next generation of data-driven global weather models

TL;DR

WeatherBench 2 advances data-driven global weather forecasting by delivering an open, end-to-end benchmark aligned with operational verification practices. It combines ERA5-based ground truth, IFS-based baselines, and a diverse set of AI models (eg, GraphCast, Pangu-Weather, NeuralGCM) and evaluates them with deterministic metrics (RMSE, ACC, Bias, SEEPS), probabilistic metrics (CRPS, spread-skill), and energy spectra, all at a common 1.5° evaluation grid. The framework emphasizes probabilistic forecasting, fair cross-model comparisons, and transparency through open data and code, while acknowledging ERA5 limitations and the need for post-processing and extremes evaluation. The results provide a snapshot of current state-of-the-art performance and establish a path toward reproducible, community-driven progress in data-driven weather forecasting.

Abstract

WeatherBench 2 is an update to the global, medium-range (1-14 day) weather forecasting benchmark proposed by Rasp et al. (2020), designed with the aim to accelerate progress in data-driven weather modeling. WeatherBench 2 consists of an open-source evaluation framework, publicly available training, ground truth and baseline data as well as a continuously updated website with the latest metrics and state-of-the-art models: https://sites.research.google/weatherbench. This paper describes the design principles of the evaluation framework and presents results for current state-of-the-art physical and data-driven weather models. The metrics are based on established practices for evaluating weather forecasts at leading operational weather centers. We define a set of headline scores to provide an overview of model performance. In addition, we also discuss caveats in the current evaluation setup and challenges for the future of data-driven weather forecasting.
Paper Structure (43 sections, 16 equations, 21 figures, 3 tables)

This paper contains 43 sections, 16 equations, 21 figures, 3 tables.

Figures (21)

  • Figure 1: Deterministic headline scorecards for upper-level variables. Values show absolute RMSE. Colors denote % difference to the IFS HRES baseline.
  • Figure 2: Deterministic headline scorecards for surface variables. Values show absolute RMSE, with the exception of precipitation which shows SEEPS (evaluated against ERA5 in all cases). Colors denote % difference to the IFS HRES baseline.
  • Figure 3: Probabilistic headline scorecards for upper-level variables. Values show absolute CRPS. Colors denote % difference to the IFS ENS baseline.
  • Figure 4: Global RMSE (SEEPS for TP24h) for headline variables for the year 2020. Note that for TP24h, IFS HRES and IFS ENS (mean) are evaluated against ERA5, since no precipitation accumulations are available for the analysis. Not all models/datasets have all variables available.
  • Figure 5: Global RMSE/SEEPS % difference compared to IFS HRES for headline variables for the year 2020. Negative values indicated lower RMSE. Note that for TP24h, IFS HRES and IFS ENS (mean) are evaluated against ERA5, since no precipitation accumulations are available for the analysis, and the metric is SEEPS (in this case not 1-SEEPS to have a consistent orientation with the other relative plots). Not all models/datasets have all variables available.
  • ...and 16 more figures