Table of Contents
Fetching ...

An Entropic Metric for Measuring Calibration of Machine Learning Models

Daniel James Sumler, Lee Devlin, Simon Maskell, Richard O. Lane

TL;DR

The paper addresses the risk-sensitivity of probabilistic ML predictions by introducing the Entropic Calibration Difference (ECD), an entropy-based calibration metric inspired by Normalised Estimation Error Squared (NEES) from target tracking. ECD differentiates under- from over-confidence and, unlike ECE, emphasizes safe calibration by penalising over-confidence more heavily; it also extends to discrete, binary settings. The authors derive the Gaussian connection between ECD and NEES, provide a discrete ECD formulation, and interpret single-datum scores. Through simulated and real-data experiments, they show ECD can reveal unsafe calibration that is not captured by existing metrics, offering a practical tool for safer decision-making in ML systems.

Abstract

Understanding the confidence with which a machine learning model classifies an input datum is an important, and perhaps under-investigated, concept. In this paper, we propose a new calibration metric, the Entropic Calibration Difference (ECD). Based on existing research in the field of state estimation, specifically target tracking (TT), we show how ECD may be applied to binary classification machine learning models. We describe the relative importance of under- and over-confidence and how they are not conflated in the TT literature. Indeed, our metric distinguishes under- from over-confidence. We consider this important given that algorithms that are under-confident are likely to be 'safer' than algorithms that are over-confident, albeit at the expense of also being over-cautious and so statistically inefficient. We demonstrate how this new metric performs on real and simulated data and compare with other metrics for machine learning model probability calibration, including the Expected Calibration Error (ECE) and its signed counterpart, the Expected Signed Calibration Error (ESCE).

An Entropic Metric for Measuring Calibration of Machine Learning Models

TL;DR

The paper addresses the risk-sensitivity of probabilistic ML predictions by introducing the Entropic Calibration Difference (ECD), an entropy-based calibration metric inspired by Normalised Estimation Error Squared (NEES) from target tracking. ECD differentiates under- from over-confidence and, unlike ECE, emphasizes safe calibration by penalising over-confidence more heavily; it also extends to discrete, binary settings. The authors derive the Gaussian connection between ECD and NEES, provide a discrete ECD formulation, and interpret single-datum scores. Through simulated and real-data experiments, they show ECD can reveal unsafe calibration that is not captured by existing metrics, offering a practical tool for safer decision-making in ML systems.

Abstract

Understanding the confidence with which a machine learning model classifies an input datum is an important, and perhaps under-investigated, concept. In this paper, we propose a new calibration metric, the Entropic Calibration Difference (ECD). Based on existing research in the field of state estimation, specifically target tracking (TT), we show how ECD may be applied to binary classification machine learning models. We describe the relative importance of under- and over-confidence and how they are not conflated in the TT literature. Indeed, our metric distinguishes under- from over-confidence. We consider this important given that algorithms that are under-confident are likely to be 'safer' than algorithms that are over-confident, albeit at the expense of also being over-cautious and so statistically inefficient. We demonstrate how this new metric performs on real and simulated data and compare with other metrics for machine learning model probability calibration, including the Expected Calibration Error (ECE) and its signed counterpart, the Expected Signed Calibration Error (ESCE).

Paper Structure

This paper contains 22 sections, 19 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: A reliability diagram for a well-calibrated model. Blue line represents $ratio = probability$, red dots represent bins.
  • Figure 2: Range of ECD scores for probabilities between 0 and 1.
  • Figure 3: Histograms of simulated data after different noise $\epsilon$ added.
  • Figure 4: Reliability diagrams of simulated data with different noise $\epsilon$ added. Blue line represents $y = x$. Red dots are numbered bins.
  • Figure 5: Reliability diagrams of (a) BERTimbau, (b) ResNet18, and (c) RoBERTa.