Table of Contents
Fetching ...

Statistical Inference for Score Decompositions

Timo Dimitriadis, Marius Puke

Abstract

We introduce inference methods for score decompositions, which partition scoring functions for predictive assessment into three interpretable components: miscalibration, discrimination, and uncertainty. Our estimation and inference relies on a linear recalibration of the forecasts, which is applicable to general multi-step ahead point forecasts such as means and quantiles due to its validity for both smooth and non-smooth scoring functions. This approach ensures desirable finite-sample properties, enables asymptotic inference, and establishes a direct connection to the classical Mincer-Zarnowitz regression. The resulting inference framework facilitates tests for equal forecast calibration or discrimination, which yield three key advantages. They enhance the information content of predictive ability tests by decomposing scores, deliver higher statistical power in certain scenarios, and formally connect scoring-function-based evaluation to traditional calibration tests, such as financial backtests. Applications demonstrate the method's utility. We find that for survey inflation forecasts, discrimination abilities can differ significantly even when overall predictive ability does not. In an application to financial risk models, our tests provide deeper insights into the calibration and information content of volatility and Value-at-Risk forecasts. By disentangling forecast accuracy from backtest performance, the method exposes critical shortcomings in current banking regulation.

Statistical Inference for Score Decompositions

Abstract

We introduce inference methods for score decompositions, which partition scoring functions for predictive assessment into three interpretable components: miscalibration, discrimination, and uncertainty. Our estimation and inference relies on a linear recalibration of the forecasts, which is applicable to general multi-step ahead point forecasts such as means and quantiles due to its validity for both smooth and non-smooth scoring functions. This approach ensures desirable finite-sample properties, enables asymptotic inference, and establishes a direct connection to the classical Mincer-Zarnowitz regression. The resulting inference framework facilitates tests for equal forecast calibration or discrimination, which yield three key advantages. They enhance the information content of predictive ability tests by decomposing scores, deliver higher statistical power in certain scenarios, and formally connect scoring-function-based evaluation to traditional calibration tests, such as financial backtests. Applications demonstrate the method's utility. We find that for survey inflation forecasts, discrimination abilities can differ significantly even when overall predictive ability does not. In an application to financial risk models, our tests provide deeper insights into the calibration and information content of volatility and Value-at-Risk forecasts. By disentangling forecast accuracy from backtest performance, the method exposes critical shortcomings in current banking regulation.
Paper Structure (26 sections, 9 theorems, 161 equations, 13 figures, 5 tables)

This paper contains 26 sections, 9 theorems, 161 equations, 13 figures, 5 tables.

Key Result

Theorem 3.2

Assume that the recalibrated forecasts are obtained by eqn:Mestimator using the same strictly consistent score $\mathsf{S}$ as used in eqn:FiniteSampleDecomposition. Furthermore, let $\bm W_{it}$ contain a constant and the forecast $X_{it}$, and let $(\widehat{r}_T,0,\dots)^\top \in \boldsymbol{\The

Figures (13)

  • Figure 1: Empirical rejection rates for mean forecasts and the squared error score for the proposed tests of equal miscalibration (in blue) in the upper panel a), equal discrimination (in orange) in the lower panel b), and of the DM test of equal predictive performance (in gray) in the lower plots rows of both panels. We generate the data according to \ref{['eqn:SimProcess']}--\ref{['eqn:SimFcast2']} with the twelve parameterizations of Table \ref{['tab:main']}, which depend on the parameter $k \ge 0$ that is displayed on the $x$-axes. We use $T = 500$ and a nominal level of $10\%$.
  • Figure 2: Empirical rejection rates for $\alpha$-quantile forecasts and the check loss for the proposed tests of equal miscalibration (in blue) in the left two columns, equal discrimination (in orange) in the right two columns, and of the DM test of equal predictive performance (in gray) in all subplots. We generate the data according to \ref{['eqn:SimProcess']} and \ref{['eqn:SimFcastq']} with the four parameterizations of Table \ref{['tab:SimQuantiles']}, which depend on the parameter $k \ge 0$ that is displayed on the $x$-axes. We use $T = 500$ and a nominal level of $10\%$.
  • Figure 3: Quarterly year-over-year CPI inflation rate together with the SPF and Michigan forecasts. For details, see the text in Section \ref{['sec:appl_infl']}.
  • Figure 4: Inflation forecast evaluation results for the SPF and Michigan forecasts using the squared error scoring function. The left "$\mathsf{MCB}$--$\mathsf{DSC}$" plots show the average score as iso-lines together with its decomposition into miscalibration and discrimination components for the four competing forecasts. The gray boxes in the right panels display the $p$-values of tests for zero miscalibration and discrimination, respectively. The white box displays the average score, miscalibration and discrimination differences, together with 0 to 3 stars indicating significance at the 10%, 5%, and 1% levels.
  • Figure 5: Variance forecast evaluation results for the E-mini futures using a) the QLIKE and b) the squared error scoring functions. The left "$\mathsf{MCB}$--$\mathsf{DSC}$" plots show the average score as iso-lines together with its decomposition into miscalibration and discrimination components for the four competing forecasts. The gray boxes in the right panels display the $p$-values of tests for zero miscalibration and discrimination, respectively. The white boxes display the average score, miscalibration and discrimination differences, together with 0 to 3 stars indicating significance at the 10%, 5%, and 1% levels.
  • ...and 8 more figures

Theorems & Definitions (18)

  • Theorem 3.2
  • Theorem 3.4
  • Theorem 3.6
  • Proposition 3.9
  • Proposition 3.11
  • Proposition 4.1
  • proof : Proof of Theorem \ref{['thm:PositivComponenets']}.
  • proof : Proof Theorem \ref{['thm:joint_nomal_mcb_dsc']}
  • proof : Proof of Theorem \ref{['thm:MCBandDSC0']}
  • proof : Proof of Proposition \ref{['prop:verification_phi']}
  • ...and 8 more