Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration

Luciana Ferrer; Daniel Ramos

Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration

Luciana Ferrer, Daniel Ramos

TL;DR

This paper argues that evaluating probabilistic classifiers should rely on expected proper scoring rules (EPSRs) rather than calibration metrics like the expected calibration error (ECE), since calibration alone does not capture the usefulness of posteriors for decision-making. It formalizes EPSRs as costs of Bayes decisions and demonstrates how they decompose into calibration and discrimination components, enabling a calibration-aware yet task-relevant evaluation. The authors introduce Calibration Loss (CalLoss) as a practical, interpretable diagnostic that measures the potential gains from post-hoc calibration, and compare it to ECE and classic score-divergence decompositions, showing CalLoss to be more reliable and actionable. Through synthetic and real-data experiments, they illustrate that EPSRs better reflect posterior quality and that calibration metrics can mislead when used for system comparison; calibration should be used for diagnostics and for guiding calibration efforts, not as a sole predictor of practical performance. The work culminates in guidance for practitioners and an open-source repository to compute these metrics for broader adoption in evaluating probabilistic classifiers.

Abstract

Most machine learning classifiers are designed to output posterior probabilities for the classes given the input sample. These probabilities may be used to make the categorical decision on the class of the sample; provided as input to a downstream system; or provided to a human for interpretation. Evaluating the quality of the posteriors generated by these system is an essential problem which was addressed decades ago with the invention of proper scoring rules (PSRs). Unfortunately, much of the recent machine learning literature uses calibration metrics -- most commonly, the expected calibration error (ECE) -- as a proxy to assess posterior performance. The problem with this approach is that calibration metrics reflect only one aspect of the quality of the posteriors, ignoring the discrimination performance. For this reason, we argue that calibration metrics should play no role in the assessment of posterior quality. Expected PSRs should instead be used for this job, preferably normalized for ease of interpretation. In this work, we first give a brief review of PSRs from a practical perspective, motivating their definition using Bayes decision theory. We discuss why expected PSRs provide a principled measure of the quality of a system's posteriors and why calibration metrics are not the right tool for this job. We argue that calibration metrics, while not useful for performance assessment, may be used as diagnostic tools during system development. With this purpose in mind, we discuss a simple and practical calibration metric, called calibration loss, derived from a decomposition of expected PSRs. We compare this metric with the ECE and with the expected score divergence calibration metric from the PSR literature and argue, using theoretical and empirical evidence, that calibration loss is superior to these two metrics.

Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration

TL;DR

Abstract

Paper Structure (28 sections, 36 equations, 5 figures, 3 tables)

This paper contains 28 sections, 36 equations, 5 figures, 3 tables.

Introduction
From Bayes decision theory to calibration
The reference distribution
Bayes decision theory
Proper scoring rules
Bayes risk
Cross-entropy
Brier score
EPSRs as integrals over Bayes risks
Calibration
Calibration and Bayes decisions
Calibration transformations
Calibration metrics
Expected score divergence
ECE: Expected calibration error
...and 13 more sections

Figures (5)

Figure 1: Weighted Bayes risk curves for four different systems. Left: the weight is 1.0 so that the integral under these curves is the CE. Right: the weight is $a_1\ (1-a_1)/\beta(2,2)$ so that the integral is the BS. The normalized CE, NCE, and normalized BS, NBS, are shown in the legends.
Figure 2: Various metrics for a binary classification task for four different systems, one perfectly calibrated (cal) and three miscalibrated ones (mcs, mcp, mcps). Left: normalized overall performance metrics. The dashed line indicates the performance of a naive system. Right: calibration metrics including the binary ECE, the multiclass ECE (ECEmc), and the RCL based on BS for two calibration approaches, DP (D), and histogram binning (H), trained either through cross-validation on the test data (xv) or training on the full test dataset (tt).
Figure 3: Overall and calibration metrics as in Figure \ref{['fig:results_2class']} but for a 10-class classfication task.
Figure 4: Expected divergence between synthetic posteriors for a binary task and $\mathbf{q}_0$, a fixed vector of posteriors. The figure shows the expected divergence as a function of the first component of $\mathbf{q}_0$, which we call $q_{01}$, for different divergences (blue curves) and the mean value of the first component of the posterior (star). The star has to coincide with the minimum of the curve for valid score divergences proving by contradiction that the L1 loss is not a score divergence.
Figure 5: Metrics with confidence intervals obtained with bootstrapping. Left: the normalized BS for the same posteriors as in Figure \ref{['fig:results_2class']}. Right: the RCL for BS using the same calibration methods as in Figure \ref{['fig:results_2class']}.

Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration

TL;DR

Abstract

Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration

Authors

TL;DR

Abstract

Table of Contents

Figures (5)