Table of Contents
Fetching ...

Enhancing the statistical evaluation of earthquake forecasts -- An application to Italy

Jonas R. Brehmer, Kristof Kraus, Tilmann Gneiting, Marcus Herrmann, Warner Marzocchi

TL;DR

The paper addresses evaluating short-term earthquake forecasts by linking CSEP-style likelihood testing with consistent scoring and reliability diagnostics. It develops a nonparametric toolkit based on proper scoring rules, consistent scoring for mean forecasts, and a Diebold–Mariano framework for inference, augmented by CORP mean-calibration and isotonic regression to assess reliability. The findings demonstrate how different Italy-based forecast ensembles perform under Poisson and quadratic losses, reveal calibration deficiencies in several models, and provide diagnostics (mean-reliability, MCB–DSC) that inform model improvement. The proposed approach is distribution-agnostic and applicable to full-distribution forecasts, offering a practical, model-agnostic toolbox for forecast evaluation and development with broad societal relevance.

Abstract

Testing earthquake forecasts is essential to obtain scientific information on forecasting models and sufficient credibility for societal usage. We aim at enhancing the testing phase proposed by the Collaboratory for the Study of Earthquake Predictability (CSEP, Schorlemmer et al., 2018) with new statistical methods supported by mathematical theory. To demonstrate their applicability, we evaluate three short-term forecasting models that were submitted to the CSEP-Italy experiment, and two ensemble models thereof. The models produce weekly overlapping forecasts for the expected number of M4+ earthquakes in a collection of grid cells. We compare the models' forecasts using consistent scoring functions for means or expectations, which are widely used and theoretically principled tools for forecast evaluation. We further discuss and demonstrate their connection to CSEP-style earthquake likelihood model testing, and specifically suggest an improvement of the T-test. Then, using tools from isotonic regression, we investigate forecast reliability and apply score decompositions in terms of calibration and discrimination. Our results show where and how models outperform their competitors and reveal a substantial lack of calibration for various models. The proposed methods also apply to full-distribution (e.g., catalog-based) forecasts, without requiring Poisson distributions or making any other type of parametric assumption.

Enhancing the statistical evaluation of earthquake forecasts -- An application to Italy

TL;DR

The paper addresses evaluating short-term earthquake forecasts by linking CSEP-style likelihood testing with consistent scoring and reliability diagnostics. It develops a nonparametric toolkit based on proper scoring rules, consistent scoring for mean forecasts, and a Diebold–Mariano framework for inference, augmented by CORP mean-calibration and isotonic regression to assess reliability. The findings demonstrate how different Italy-based forecast ensembles perform under Poisson and quadratic losses, reveal calibration deficiencies in several models, and provide diagnostics (mean-reliability, MCB–DSC) that inform model improvement. The proposed approach is distribution-agnostic and applicable to full-distribution forecasts, offering a practical, model-agnostic toolbox for forecast evaluation and development with broad societal relevance.

Abstract

Testing earthquake forecasts is essential to obtain scientific information on forecasting models and sufficient credibility for societal usage. We aim at enhancing the testing phase proposed by the Collaboratory for the Study of Earthquake Predictability (CSEP, Schorlemmer et al., 2018) with new statistical methods supported by mathematical theory. To demonstrate their applicability, we evaluate three short-term forecasting models that were submitted to the CSEP-Italy experiment, and two ensemble models thereof. The models produce weekly overlapping forecasts for the expected number of M4+ earthquakes in a collection of grid cells. We compare the models' forecasts using consistent scoring functions for means or expectations, which are widely used and theoretically principled tools for forecast evaluation. We further discuss and demonstrate their connection to CSEP-style earthquake likelihood model testing, and specifically suggest an improvement of the T-test. Then, using tools from isotonic regression, we investigate forecast reliability and apply score decompositions in terms of calibration and discrimination. Our results show where and how models outperform their competitors and reveal a substantial lack of calibration for various models. The proposed methods also apply to full-distribution (e.g., catalog-based) forecasts, without requiring Poisson distributions or making any other type of parametric assumption.
Paper Structure (21 sections, 43 equations, 12 figures, 5 tables, 1 algorithm)

This paper contains 21 sections, 43 equations, 12 figures, 5 tables, 1 algorithm.

Figures (12)

  • Figure 1: Left: Forecast region of OEF-Italy (orange grid, 8993 grid cells, which corresponds to the testing region of the Italian CSEP experiment) and locations of observed M4+ target earthquakes (crossed circles) between 2005 April 16 and 2020 May 20; Right: Logarithmic bar plot of earthquake magnitudes. For similar displays, see Figure 1 of HM2023, Figure 1 of Spassiani2023, and Figure 2 of Brehmetal2021.
  • Figure 2: Expected number forecasts of the five models. Top: temporal evolution, aggregated over the testing region for each day; Bottom: spatial distribution for the initial seven-day period in OEF-Italy, with forecast values below $10^{-7}$ represented in white.
  • Figure 3: Logarithmic Murphy diagram for the five forecast models. Each curve plots a model's total elementary score $\bar{\mathsf{S}}_\theta$ from \ref{['eq:Sbar']} versus $\log \theta$. Tickmarks at bottom indicate $\log \theta$; tickmarks at top show $\theta$. The colored bar at top indicates the model with the lowest value of $\bar{\mathsf{S}}_\theta$. The integral under a model's curve equals the average Poisson score from Table \ref{['tab:scores']}.
  • Figure 4: From top to bottom: Spatially aggregated daily Poisson score \ref{['eq:Sbar_t']} for the five forecast models; Daily Poisson score difference relative to the LM model; Cumulative Poisson score difference or information gain (IG) of the LM model over the other models; Information gain per earthquake (IGPE) of the LM model over the other models. In the first two panels, two different markers are used: triangles for periods with one or more M4+ target earthquakes, and circles otherwise. Note the logarithmic scale in the upper two panels. All quantities are negatively oriented, i.e., the smaller the better for the color-coded model. For technical details see Appendix \ref{['app:figure']}.
  • Figure 5: Histograms of $p$ values for Diebold--Mariano tests of equal predictive ability in terms of the Poisson scoring function for (Left) Mix$_A$ versus Mix$_B$; (Middle) Mix$_A$ versus LM; and (Right) Mix$_B$ versus LM, based on 400 replicates.
  • ...and 7 more figures