Table of Contents
Fetching ...

Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an AI Weather Prediction Model Against a High-Resolution NWP Model

Nicholas Loveday, Tracy Hertneky

TL;DR

The paper develops a spatial verification framework by merging the HiRA neighbourhood approach with threshold-weighted scoring (twCRPS) to compare AIWP forecasts against a high-resolution NWP model without re-gridding. Using 32 months of ASOS observations and forecasts from GraphCast-GFS and HRRR, it demonstrates how neighbourhood size and lead time influence performance, showing HRRR generally dominates when neighbourhoods are matched, especially for short-term extremes. It also extends the analysis to discrimination ability via a CORP-like decomposition, highlighting how extreme-event evaluation benefits from threshold weighting and spatial processing. The framework is practical for operational use, avoids forecaster hedging, and is extensible to multivariate and ensemble applications, offering a robust tool for assessing extreme-weather forecasts across resolutions.

Abstract

Recent advances in AI-based weather prediction have led to the development of artificial intelligence weather prediction (AIWP) models with competitive forecast skill compared to traditional NWP models, but with substantially reduced computational cost. There is a strong need for appropriate methods to evaluate their ability to predict extreme weather events, particularly when spatial coherence is important, and grid resolutions differ between models. We introduce a verification framework that combines spatial verification methods and proper scoring rules. Specifically, the framework extends the High-Resolution Assessment (HiRA) approach with threshold-weighted scoring rules. It enables user-oriented evaluation consistent with how forecasts may be interpreted by operational meteorologists or used in simple post-processing systems. The method supports targeted evaluation of extreme events by allowing flexible weighting of the relative importance of different decision thresholds. We demonstrate this framework by evaluating 32 months of precipitation forecasts from an AIWP model and a high-resolution NWP model. Our results show that model rankings are sensitive to the choice of neighbourhood size. Increasing the neighbourhood size has a greater impact on scores evaluating extreme-event performance for the high-resolution NWP model than for the AIWP model. At equivalent neighbourhood sizes, the high-resolution NWP model only outperformed the AIWP model in predicting extreme precipitation events at short lead times. We also demonstrate how this approach can be extended to evaluate discrimination ability in predicting heavy precipitation. We find that the high-resolution NWP model had superior discrimination ability at short lead times, while the AIWP model had slightly better discrimination ability from a lead time of 24-hours onwards.

Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an AI Weather Prediction Model Against a High-Resolution NWP Model

TL;DR

The paper develops a spatial verification framework by merging the HiRA neighbourhood approach with threshold-weighted scoring (twCRPS) to compare AIWP forecasts against a high-resolution NWP model without re-gridding. Using 32 months of ASOS observations and forecasts from GraphCast-GFS and HRRR, it demonstrates how neighbourhood size and lead time influence performance, showing HRRR generally dominates when neighbourhoods are matched, especially for short-term extremes. It also extends the analysis to discrimination ability via a CORP-like decomposition, highlighting how extreme-event evaluation benefits from threshold weighting and spatial processing. The framework is practical for operational use, avoids forecaster hedging, and is extensible to multivariate and ensemble applications, offering a robust tool for assessing extreme-weather forecasts across resolutions.

Abstract

Recent advances in AI-based weather prediction have led to the development of artificial intelligence weather prediction (AIWP) models with competitive forecast skill compared to traditional NWP models, but with substantially reduced computational cost. There is a strong need for appropriate methods to evaluate their ability to predict extreme weather events, particularly when spatial coherence is important, and grid resolutions differ between models. We introduce a verification framework that combines spatial verification methods and proper scoring rules. Specifically, the framework extends the High-Resolution Assessment (HiRA) approach with threshold-weighted scoring rules. It enables user-oriented evaluation consistent with how forecasts may be interpreted by operational meteorologists or used in simple post-processing systems. The method supports targeted evaluation of extreme events by allowing flexible weighting of the relative importance of different decision thresholds. We demonstrate this framework by evaluating 32 months of precipitation forecasts from an AIWP model and a high-resolution NWP model. Our results show that model rankings are sensitive to the choice of neighbourhood size. Increasing the neighbourhood size has a greater impact on scores evaluating extreme-event performance for the high-resolution NWP model than for the AIWP model. At equivalent neighbourhood sizes, the high-resolution NWP model only outperformed the AIWP model in predicting extreme precipitation events at short lead times. We also demonstrate how this approach can be extended to evaluate discrimination ability in predicting heavy precipitation. We find that the high-resolution NWP model had superior discrimination ability at short lead times, while the AIWP model had slightly better discrimination ability from a lead time of 24-hours onwards.

Paper Structure

This paper contains 16 sections, 8 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: A map of ASOS station locations used in this study.
  • Figure 2: Graphical illustration of the threshold-weighted continuous ranked probability score (twCRPS) with a uniform weight of 1 applied to all thresholds $z > t$ and a weight of 0 applied elsewhere. (a) The solid blue curve shows the forecast cumulative distribution function (CDF) and the dashed orange line represents the Heaviside step function of the observation. The twCRPS is the integrated squared difference between the solid blue curve and the dashed orange line, with interval of integration ($x > t$) indicated by the green shaded region. (b) The threshold weight function $w(z) = \mathds{1}(z > t)$. (c) The corresponding chaining function $v(z) = \max(z, t)$.
  • Figure 3: (a) Mean CRPS results aggregated across all stations and timesteps. Lower scores are better. (b) Difference between GraphCast-GFS 1$\times$1 and HRRR 1$\times$1 with 99% confidence intervals. (c) Difference between GraphCast-GFS 1$\times$1 and HRRR 7$\times$9 ($21\times27$ km equivalent) with 99% confidence intervals. (d) Difference between GraphCast-GFS 3$\times$3 and HRRR 21$\times$27 (63$\times$81 km equivalent)with 99% confidence intervals. In subfigures b-d, positive values indicate that HRRR performed better than GraphCast-GFS for the specified neighbourhoods.
  • Figure 4: As for Fig. \ref{['fig:crps_results']} but for the twCRPS with a threshold weight function of $w(z) = \mathds{1}(z>q_{0.99})$.
  • Figure 5: Brier score decomposition of the CRPS within the HiRA framework. Lower scores are better. The left panels (a) and (c) show the mean Brier score for thresholds below 30 mm, while the right panels (b) and (d) show the mean Brier score for threshold between 30 and 80 mm with a logarithmic vertical axis. Results are shown for lead time 6 hour forecasts in panels (a) and (b), and lead time 30 hour forecasts in (c) and (d).
  • ...and 3 more figures