Evaluating Posterior Probabilities: Decision Theory, Proper Scoring Rules, and Calibration
Luciana Ferrer, Daniel Ramos
TL;DR
This paper argues that evaluating probabilistic classifiers should rely on expected proper scoring rules (EPSRs) rather than calibration metrics like the expected calibration error (ECE), since calibration alone does not capture the usefulness of posteriors for decision-making. It formalizes EPSRs as costs of Bayes decisions and demonstrates how they decompose into calibration and discrimination components, enabling a calibration-aware yet task-relevant evaluation. The authors introduce Calibration Loss (CalLoss) as a practical, interpretable diagnostic that measures the potential gains from post-hoc calibration, and compare it to ECE and classic score-divergence decompositions, showing CalLoss to be more reliable and actionable. Through synthetic and real-data experiments, they illustrate that EPSRs better reflect posterior quality and that calibration metrics can mislead when used for system comparison; calibration should be used for diagnostics and for guiding calibration efforts, not as a sole predictor of practical performance. The work culminates in guidance for practitioners and an open-source repository to compute these metrics for broader adoption in evaluating probabilistic classifiers.
Abstract
Most machine learning classifiers are designed to output posterior probabilities for the classes given the input sample. These probabilities may be used to make the categorical decision on the class of the sample; provided as input to a downstream system; or provided to a human for interpretation. Evaluating the quality of the posteriors generated by these system is an essential problem which was addressed decades ago with the invention of proper scoring rules (PSRs). Unfortunately, much of the recent machine learning literature uses calibration metrics -- most commonly, the expected calibration error (ECE) -- as a proxy to assess posterior performance. The problem with this approach is that calibration metrics reflect only one aspect of the quality of the posteriors, ignoring the discrimination performance. For this reason, we argue that calibration metrics should play no role in the assessment of posterior quality. Expected PSRs should instead be used for this job, preferably normalized for ease of interpretation. In this work, we first give a brief review of PSRs from a practical perspective, motivating their definition using Bayes decision theory. We discuss why expected PSRs provide a principled measure of the quality of a system's posteriors and why calibration metrics are not the right tool for this job. We argue that calibration metrics, while not useful for performance assessment, may be used as diagnostic tools during system development. With this purpose in mind, we discuss a simple and practical calibration metric, called calibration loss, derived from a decomposition of expected PSRs. We compare this metric with the ECE and with the expected score divergence calibration metric from the PSR literature and argue, using theoretical and empirical evidence, that calibration loss is superior to these two metrics.
