Table of Contents
Fetching ...

Scoring Rules and Calibration for Imprecise Probabilities

Christian Fröhlich, Robert C. Williamson

TL;DR

It is argued that proper scoring rules and calibration serve two distinct goals, which are aligned in the precise case, but intriguingly are not necessarily aligned in the imprecise case.

Abstract

What does it mean to say that, for example, the probability for rain tomorrow is between 20% and 30%? The theory for the evaluation of precise probabilistic forecasts is well-developed and is grounded in the key concepts of proper scoring rules and calibration. For the case of imprecise probabilistic forecasts (sets of probabilities), such theory is still lacking. In this work, we therefore generalize proper scoring rules and calibration to the imprecise case. We develop these concepts as relative to data models and decision problems. As a consequence, the imprecision is embedded in a clear context. We establish a close link to the paradigm of (group) distributional robustness and in doing so provide new insights for it. We argue that proper scoring rules and calibration serve two distinct goals, which are aligned in the precise case, but intriguingly are not necessarily aligned in the imprecise case. The concept of decision-theoretic entropy plays a key role for both goals. Finally, we demonstrate the theoretical insights in machine learning practice, in particular we illustrate subtle pitfalls relating to the choice of loss function in distributional robustness.

Scoring Rules and Calibration for Imprecise Probabilities

TL;DR

It is argued that proper scoring rules and calibration serve two distinct goals, which are aligned in the precise case, but intriguingly are not necessarily aligned in the imprecise case.

Abstract

What does it mean to say that, for example, the probability for rain tomorrow is between 20% and 30%? The theory for the evaluation of precise probabilistic forecasts is well-developed and is grounded in the key concepts of proper scoring rules and calibration. For the case of imprecise probabilistic forecasts (sets of probabilities), such theory is still lacking. In this work, we therefore generalize proper scoring rules and calibration to the imprecise case. We develop these concepts as relative to data models and decision problems. As a consequence, the imprecision is embedded in a clear context. We establish a close link to the paradigm of (group) distributional robustness and in doing so provide new insights for it. We argue that proper scoring rules and calibration serve two distinct goals, which are aligned in the precise case, but intriguingly are not necessarily aligned in the imprecise case. The concept of decision-theoretic entropy plays a key role for both goals. Finally, we demonstrate the theoretical insights in machine learning practice, in particular we illustrate subtle pitfalls relating to the choice of loss function in distributional robustness.

Paper Structure

This paper contains 41 sections, 17 theorems, 104 equations, 5 figures, 3 tables.

Key Result

Proposition 2.6

Let $\ell \in \mathcal{L}$ and $\mathcal{P},\mathcal{Q} \in \mathcal{IP}$. Then it holds that

Figures (5)

  • Figure 1: Evaluation of IP calibration under the train (left column) and test (right column) data models on acs pums. In each subplot, the forecasts $\mathcal{Q}$ vary along the $x$-axis and the loss function used for evaluation corresponds to a hue. The top row shows diagnostic values for IP calibration without groups, that is, the $y$-axis shows $\overline{R}_{\hat{\mathcal{P}}}\left(\tilde{S}_\ell(\omega) - \overline{R}_{\mathcal{Q}(\omega)}\left( \tilde{S}_\ell \right) \right)$ (top left) and $\overline{R}_{\hat{\mathcal{P}}_\text{test}}\left(\tilde{S}_\ell(\omega) - \overline{R}_{\mathcal{Q}(\omega)}\left( \tilde{S}_\ell \right) \right)$ (top right). In the other subplots, the $y$-axis shows the diagnostics for IP calibration with respect to the action-induced partition (of Equation \ref{['eq:deccaldiagnostic']}) under $\hat{P}$ (left) and $\hat{P}_{\text{test}}$ (right), respectively. We observe that the $\textbf{GBR}$ forecast, in contrast to the other forecasts, is sub-calibrated in almost all cases (recall that we consider negative values on the $y$-axis as more desirable than positive values).
  • Figure 2: Left: a visualization of the asymmetric loss function (the two curves are the partial loss functions, meaning each curve has a fixed outcome $y \in \{0,1\}$). Middle: the discrete step function and the sigmoid approximation. The difference is hardly discernible. Right: by zooming in near $c=0.1$, the difference between the original loss function and its approximation can be detected.
  • Figure 3: Comparison of unconditional entropies $\hat{H}_{\ell_c}$ and $\hat{H}_{\ell_{W;c}}$ (rescaled to obtain maximum value of $1$) of $\ell_c$ and $\ell_{W;c}$ with $c=0.1$. The x-axis corresponds to the set of probabilities $\Delta^2$ on the binary $Y$.
  • Figure 4: Evaluation of IP calibration under the train (left column) and test (right column) data models on framingham. In each subplot, the forecasts $\mathcal{Q}$ vary along the $x$-axis and the loss function used for evaluation corresponds to a hue. The top row shows diagnostic values for IP calibration without groups, that is, the $y$-axis shows $\overline{R}_{\hat{\mathcal{P}}}\left(\tilde{S}_\ell(\omega) - \overline{R}_{\mathcal{Q}(\omega)}\left( \tilde{S}_\ell \right) \right)$ (top left) and $\overline{R}_{\hat{\mathcal{P}}_\text{test}}\left(\tilde{S}_\ell(\omega) - \overline{R}_{\mathcal{Q}(\omega)}\left( \tilde{S}_\ell \right) \right)$ (top right). In the other subplots, the $y$-axis shows the diagnostics for IP calibration with respect to the action-induced partition (of Equation \ref{['eq:deccaldiagnostic']}) under $\hat{P}$ (left) and $\hat{P}_{\text{test}}$ (right), respectively.
  • Figure 5: Evaluation of IP calibration under the train (left column) and test (right column) data models on celeba. In each subplot, the forecasts $\mathcal{Q}$ vary along the $x$-axis and the loss function used for evaluation corresponds to a hue. The top row shows diagnostic values for IP calibration without groups, that is, the $y$-axis shows $\overline{R}_{\hat{\mathcal{P}}}\left(\tilde{S}_\ell(\omega) - \overline{R}_{\mathcal{Q}(\omega)}\left( \tilde{S}_\ell \right) \right)$ (top left) and $\overline{R}_{\hat{\mathcal{P}}_\text{test}}\left(\tilde{S}_\ell(\omega) - \overline{R}_{\mathcal{Q}(\omega)}\left( \tilde{S}_\ell \right) \right)$ (top right). In the other subplots, the $y$-axis shows the diagnostics for IP calibration with respect to the action-induced partition (of Equation \ref{['eq:deccaldiagnostic']}) under $\hat{P}$ (left) and $\hat{P}_{\text{test}}$ (right), respectively.

Theorems & Definitions (42)

  • Example 2.1
  • Definition 2.3
  • Proposition 2.6: Weak propriety
  • Proposition 2.7: Partial failure of strict propriety
  • Proposition 2.8: "Strong" propriety
  • Definition 2.9
  • Proposition 2.10
  • Remark 2.11: Connection to standard setup and intuition for uniqueness condition
  • Definition 2.13: walley1991statistical
  • Definition 2.14
  • ...and 32 more