Enhancing the statistical evaluation of earthquake forecasts -- An application to Italy
Jonas R. Brehmer, Kristof Kraus, Tilmann Gneiting, Marcus Herrmann, Warner Marzocchi
TL;DR
The paper addresses evaluating short-term earthquake forecasts by linking CSEP-style likelihood testing with consistent scoring and reliability diagnostics. It develops a nonparametric toolkit based on proper scoring rules, consistent scoring for mean forecasts, and a Diebold–Mariano framework for inference, augmented by CORP mean-calibration and isotonic regression to assess reliability. The findings demonstrate how different Italy-based forecast ensembles perform under Poisson and quadratic losses, reveal calibration deficiencies in several models, and provide diagnostics (mean-reliability, MCB–DSC) that inform model improvement. The proposed approach is distribution-agnostic and applicable to full-distribution forecasts, offering a practical, model-agnostic toolbox for forecast evaluation and development with broad societal relevance.
Abstract
Testing earthquake forecasts is essential to obtain scientific information on forecasting models and sufficient credibility for societal usage. We aim at enhancing the testing phase proposed by the Collaboratory for the Study of Earthquake Predictability (CSEP, Schorlemmer et al., 2018) with new statistical methods supported by mathematical theory. To demonstrate their applicability, we evaluate three short-term forecasting models that were submitted to the CSEP-Italy experiment, and two ensemble models thereof. The models produce weekly overlapping forecasts for the expected number of M4+ earthquakes in a collection of grid cells. We compare the models' forecasts using consistent scoring functions for means or expectations, which are widely used and theoretically principled tools for forecast evaluation. We further discuss and demonstrate their connection to CSEP-style earthquake likelihood model testing, and specifically suggest an improvement of the T-test. Then, using tools from isotonic regression, we investigate forecast reliability and apply score decompositions in terms of calibration and discrimination. Our results show where and how models outperform their competitors and reveal a substantial lack of calibration for various models. The proposed methods also apply to full-distribution (e.g., catalog-based) forecasts, without requiring Poisson distributions or making any other type of parametric assumption.
