Table of Contents
Fetching ...

Validating Deep Learning Weather Forecast Models on Recent High-Impact Extreme Events

Olivier C. Pasche, Jonathan Wider, Zhongwei Zhang, Jakob Zscheischler, Sebastian Engelke

TL;DR

This paper tackles the challenge of evaluating deep-learning weather forecast models on recent high-impact extremes, where traditional metrics may miss tail risks. It compares three ML models—GraphCast, PanguWeather, and FourCastNet—against ECMWF's HRES using ERA5-based training and HRES-fc0 ground truth across three case studies: the 2021 Pacific Northwest heatwave, the 2023 South Asian humid heatwave, and the 2021 North American winter storm, with impact metrics such as $HI$ and $T_{wc}$. The results show that ML models can match HRES locally during the PNW heatwave but often underperform when aggregating over space/time; they also exhibit stronger performance on the North American winter storm for some metrics, while humidity-driven impacts in the humid heatwave prove challenging due to missing surface humidity outputs. The study demonstrates the value of case-study–driven, impact-centric evaluation to reveal model strengths and gaps, guide data and variable requirements, and inform future development toward more reliable ML-based weather forecasts.

Abstract

The forecast accuracy of machine learning (ML) weather prediction models is improving rapidly, leading many to speak of a "second revolution in weather forecasting". With numerous methods being developed and limited physical guarantees offered by ML models, there is a critical need for a comprehensive evaluation of these emerging techniques. While this need has been partly fulfilled by benchmark datasets, they provide little information on rare and impactful extreme events or on compound impact metrics, for which model accuracy might degrade due to misrepresented dependencies between variables. To address these issues, we compare ML weather prediction models (GraphCast, PanguWeather, and FourCastNet) and ECMWF's high-resolution forecast system (HRES) in three case studies: the 2021 Pacific Northwest heatwave, the 2023 South Asian humid heatwave, and the North American winter storm in 2021. We find that ML weather prediction models locally achieve similar accuracy to HRES on the record-shattering Pacific Northwest heatwave but underperform when aggregated over space and time. However, they forecast the compound winter storm substantially better. We also highlight structural differences in how the errors of HRES and the ML models build up to that event. The ML forecasts lack important variables for a detailed assessment of the health risks of the 2023 humid heatwave. Using a possible substitute variable, prediction errors show spatial patterns with the highest danger levels over Bangladesh being underestimated by the ML models. Generally, case-study-driven, impact-centric evaluation can complement existing research, increase public trust, and aid in developing reliable ML weather prediction models.

Validating Deep Learning Weather Forecast Models on Recent High-Impact Extreme Events

TL;DR

This paper tackles the challenge of evaluating deep-learning weather forecast models on recent high-impact extremes, where traditional metrics may miss tail risks. It compares three ML models—GraphCast, PanguWeather, and FourCastNet—against ECMWF's HRES using ERA5-based training and HRES-fc0 ground truth across three case studies: the 2021 Pacific Northwest heatwave, the 2023 South Asian humid heatwave, and the 2021 North American winter storm, with impact metrics such as and . The results show that ML models can match HRES locally during the PNW heatwave but often underperform when aggregating over space/time; they also exhibit stronger performance on the North American winter storm for some metrics, while humidity-driven impacts in the humid heatwave prove challenging due to missing surface humidity outputs. The study demonstrates the value of case-study–driven, impact-centric evaluation to reveal model strengths and gaps, guide data and variable requirements, and inform future development toward more reliable ML-based weather forecasts.

Abstract

The forecast accuracy of machine learning (ML) weather prediction models is improving rapidly, leading many to speak of a "second revolution in weather forecasting". With numerous methods being developed and limited physical guarantees offered by ML models, there is a critical need for a comprehensive evaluation of these emerging techniques. While this need has been partly fulfilled by benchmark datasets, they provide little information on rare and impactful extreme events or on compound impact metrics, for which model accuracy might degrade due to misrepresented dependencies between variables. To address these issues, we compare ML weather prediction models (GraphCast, PanguWeather, and FourCastNet) and ECMWF's high-resolution forecast system (HRES) in three case studies: the 2021 Pacific Northwest heatwave, the 2023 South Asian humid heatwave, and the North American winter storm in 2021. We find that ML weather prediction models locally achieve similar accuracy to HRES on the record-shattering Pacific Northwest heatwave but underperform when aggregated over space and time. However, they forecast the compound winter storm substantially better. We also highlight structural differences in how the errors of HRES and the ML models build up to that event. The ML forecasts lack important variables for a detailed assessment of the health risks of the 2023 humid heatwave. Using a possible substitute variable, prediction errors show spatial patterns with the highest danger levels over Bangladesh being underestimated by the ML models. Generally, case-study-driven, impact-centric evaluation can complement existing research, increase public trust, and aid in developing reliable ML weather prediction models.
Paper Structure (24 sections, 10 equations, 19 figures, 5 tables)

This paper contains 24 sections, 10 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Magnitudes of the three events analyzed in this paper. (A) 2021 Pacific Northwest heatwave. Shown is the 2m temperature anomaly averaged over 27--29 June 2021, the peak of the heatwave. (B) 2023 South Asian humid heatwave. Shown is the category of maximum daily Heat Index $HI$, as defined in \ref{['ss:shape_files']}, averaged over 17--20 April 2023 in India and Bangladesh. (C) 2021 North American winter storm. Shown is the wind chill index $T_{wc}$, as defined in \ref{['ss:na_winterstorm']}, on 12:00 UTC, 15 February 2021.
  • Figure 2: Panels (A1) to (D3): Predictability barrier plots for the grid cells closest to major cities affected by the 2021 heatwave. For HRES, HRES-fc0 is used as ground truth, for the ML models, we use ERA5 instead. In the color bar, $D_5$ and $D_{10}$ indicate long-term multi-year average HRES 5-day and 10-day prediction errors. For the computation of the $\mathrm{RMSE}$, $D_5$, and $D_{10}$ see \ref{['ss:rmse']}. Numerical values for $D_5$ and $D_{10}$ are given in \ref{['tab:rmses']}. Panels (E1) to (E3): time series of daily maximum $T_{2m}$ for the data sets used as ground truth.
  • Figure 3: Evolution of the $T_{2m}$ prediction RMSE with lead-time for the three ML models and HRES in the event region during the peak of the heatwave (June 27--29 2021, left) compared to summer 2022 as a baseline (June 20--July 10, right). Observations in the considered box region, $45^\circ$--$52^\circ$N, $119^\circ$--$123^\circ$W, are weighted to correct for differences in grid-cell area. ML models use 06:00/18:00 UTC initial conditions and evaluation times only, and the HRES forecasts use the mixed initialization described in \ref{['ss:init-time']} after 3.75 (dotted line).
  • Figure 4: Error of the ${HI}$ prediction, for the time step of each day during which $HI$ peaked in the ground truth data set, averaged over April 17--20, 2023. For all forecasting methods and ground truth data sets, ${HI}$ is computed using ${RH}_{\qty{1000}{\hecto\pascal}}$ rather than the value at the surface.
  • Figure 5: Proportion of area in study region with given mean daily maximum heat index during April 17–20, 2023, computed using area-weighted kernel density estimation. Shaded areas in the background indicate threat levels (see \ref{['ss:heat-index']}). Light gray to dark gray: low risk, caution, extreme caution, danger, extreme danger. Compared are distributions resulting from forecasts initialized 6 days prior to the start of the event and different ground truths: ERA5 and HRES-fc0, each in two versions of computing the heat index either using $RH_{sfc}$ or using the substitute $RH_{\qty{1000}{\hecto\pascal}}$. For HRES forecasts, we show versions computed with $RH_{\qty{1000}{\hecto\pascal}}$ and $RH_{sfc}$ as well.
  • ...and 14 more figures