WeatherBench 2: A benchmark for the next generation of data-driven global weather models
Stephan Rasp, Stephan Hoyer, Alexander Merose, Ian Langmore, Peter Battaglia, Tyler Russel, Alvaro Sanchez-Gonzalez, Vivian Yang, Rob Carver, Shreya Agrawal, Matthew Chantry, Zied Ben Bouallegue, Peter Dueben, Carla Bromberg, Jared Sisk, Luke Barrington, Aaron Bell, Fei Sha
TL;DR
WeatherBench 2 advances data-driven global weather forecasting by delivering an open, end-to-end benchmark aligned with operational verification practices. It combines ERA5-based ground truth, IFS-based baselines, and a diverse set of AI models (eg, GraphCast, Pangu-Weather, NeuralGCM) and evaluates them with deterministic metrics (RMSE, ACC, Bias, SEEPS), probabilistic metrics (CRPS, spread-skill), and energy spectra, all at a common 1.5° evaluation grid. The framework emphasizes probabilistic forecasting, fair cross-model comparisons, and transparency through open data and code, while acknowledging ERA5 limitations and the need for post-processing and extremes evaluation. The results provide a snapshot of current state-of-the-art performance and establish a path toward reproducible, community-driven progress in data-driven weather forecasting.
Abstract
WeatherBench 2 is an update to the global, medium-range (1-14 day) weather forecasting benchmark proposed by Rasp et al. (2020), designed with the aim to accelerate progress in data-driven weather modeling. WeatherBench 2 consists of an open-source evaluation framework, publicly available training, ground truth and baseline data as well as a continuously updated website with the latest metrics and state-of-the-art models: https://sites.research.google/weatherbench. This paper describes the design principles of the evaluation framework and presents results for current state-of-the-art physical and data-driven weather models. The metrics are based on established practices for evaluating weather forecasts at leading operational weather centers. We define a set of headline scores to provide an overview of model performance. In addition, we also discuss caveats in the current evaluation setup and challenges for the future of data-driven weather forecasting.
