Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an AI Weather Prediction Model Against a High-Resolution NWP Model

Nicholas Loveday; Tracy Hertneky

Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an AI Weather Prediction Model Against a High-Resolution NWP Model

Nicholas Loveday, Tracy Hertneky

TL;DR

The paper develops a spatial verification framework by merging the HiRA neighbourhood approach with threshold-weighted scoring (twCRPS) to compare AIWP forecasts against a high-resolution NWP model without re-gridding. Using 32 months of ASOS observations and forecasts from GraphCast-GFS and HRRR, it demonstrates how neighbourhood size and lead time influence performance, showing HRRR generally dominates when neighbourhoods are matched, especially for short-term extremes. It also extends the analysis to discrimination ability via a CORP-like decomposition, highlighting how extreme-event evaluation benefits from threshold weighting and spatial processing. The framework is practical for operational use, avoids forecaster hedging, and is extensible to multivariate and ensemble applications, offering a robust tool for assessing extreme-weather forecasts across resolutions.

Abstract

Recent advances in AI-based weather prediction have led to the development of artificial intelligence weather prediction (AIWP) models with competitive forecast skill compared to traditional NWP models, but with substantially reduced computational cost. There is a strong need for appropriate methods to evaluate their ability to predict extreme weather events, particularly when spatial coherence is important, and grid resolutions differ between models. We introduce a verification framework that combines spatial verification methods and proper scoring rules. Specifically, the framework extends the High-Resolution Assessment (HiRA) approach with threshold-weighted scoring rules. It enables user-oriented evaluation consistent with how forecasts may be interpreted by operational meteorologists or used in simple post-processing systems. The method supports targeted evaluation of extreme events by allowing flexible weighting of the relative importance of different decision thresholds. We demonstrate this framework by evaluating 32 months of precipitation forecasts from an AIWP model and a high-resolution NWP model. Our results show that model rankings are sensitive to the choice of neighbourhood size. Increasing the neighbourhood size has a greater impact on scores evaluating extreme-event performance for the high-resolution NWP model than for the AIWP model. At equivalent neighbourhood sizes, the high-resolution NWP model only outperformed the AIWP model in predicting extreme precipitation events at short lead times. We also demonstrate how this approach can be extended to evaluate discrimination ability in predicting heavy precipitation. We find that the high-resolution NWP model had superior discrimination ability at short lead times, while the AIWP model had slightly better discrimination ability from a lead time of 24-hours onwards.

Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an AI Weather Prediction Model Against a High-Resolution NWP Model

TL;DR

Abstract

Evaluating Extreme Precipitation Forecasts: A Threshold-Weighted, Spatial Verification Approach for Comparing an AI Weather Prediction Model Against a High-Resolution NWP Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (8)