Table of Contents
Fetching ...

Calibration through the Lens of Interpretability

Alireza Torabian, Ruth Urner

TL;DR

This work develops an axiomatic framework for calibration that separates calibration from accuracy and interpretability. It formalizes five desiderata—calibration, accuracy, approximating the regression function, interpretability via small, identifiable cells, and monotonicity with respect to the data-generating regression function—and analyzes their mutual relationships. It introduces relaxed, population-level metrics (CE_p,D, RMSE, PC, KT) and analyzes two interpretability-preserving operations (cell merging and average label assignment), deriving theoretical effects on calibration and related measures. Through an extensive empirical study on 36 real datasets, it compares interpretable decision trees to standard calibration methods (Platt scaling, isotonic regression, and PCT), showing that DT can offer competitive calibration while providing interpretable outputs, with PDE emerging as a favorable calibration metric. The paper argues for incorporating interpretability as a core criterion in calibration to ensure meaningful confidence scores for end users.

Abstract

Calibration is a frequently invoked concept when useful label probability estimates are required on top of classification accuracy. A calibrated model is a function whose values correctly reflect underlying label probabilities. Calibration in itself however does not imply classification accuracy, nor human interpretable estimates, nor is it straightforward to verify calibration from finite data. There is a plethora of evaluation metrics (and loss functions) that each assess a specific aspect of a calibration model. In this work, we initiate an axiomatic study of the notion of calibration. We catalogue desirable properties of calibrated models as well as corresponding evaluation metrics and analyze their feasibility and correspondences. We complement this analysis with an empirical evaluation, comparing common calibration methods to employing a simple, interpretable decision tree.

Calibration through the Lens of Interpretability

TL;DR

This work develops an axiomatic framework for calibration that separates calibration from accuracy and interpretability. It formalizes five desiderata—calibration, accuracy, approximating the regression function, interpretability via small, identifiable cells, and monotonicity with respect to the data-generating regression function—and analyzes their mutual relationships. It introduces relaxed, population-level metrics (CE_p,D, RMSE, PC, KT) and analyzes two interpretability-preserving operations (cell merging and average label assignment), deriving theoretical effects on calibration and related measures. Through an extensive empirical study on 36 real datasets, it compares interpretable decision trees to standard calibration methods (Platt scaling, isotonic regression, and PCT), showing that DT can offer competitive calibration while providing interpretable outputs, with PDE emerging as a favorable calibration metric. The paper argues for incorporating interpretability as a core criterion in calibration to ensure meaningful confidence scores for end users.

Abstract

Calibration is a frequently invoked concept when useful label probability estimates are required on top of classification accuracy. A calibrated model is a function whose values correctly reflect underlying label probabilities. Calibration in itself however does not imply classification accuracy, nor human interpretable estimates, nor is it straightforward to verify calibration from finite data. There is a plethora of evaluation metrics (and loss functions) that each assess a specific aspect of a calibration model. In this work, we initiate an axiomatic study of the notion of calibration. We catalogue desirable properties of calibrated models as well as corresponding evaluation metrics and analyze their feasibility and correspondences. We complement this analysis with an empirical evaluation, comparing common calibration methods to employing a simple, interpretable decision tree.

Paper Structure

This paper contains 22 sections, 8 theorems, 43 equations, 11 figures, 3 tables.

Key Result

Theorem 2

There exist predictors $f$ different from $\eta_D$ (with positive probability) satisfying both perfect calibration and optimal classification accuracy if and only if one of the sets $(\mathrm{range}_{D}(\eta_D) \cap [0, 0.5))$ and $(\mathrm{range}_{D}(\eta_D) \cap [0.5,1])$ has size at least $2$ (th

Figures (11)

  • Figure 1: Interplay of calibration desiderata. The intersection of strictly monotonic and perfectly calibrated predictors only contains the regression function $\eta_D$ (and functions that agree with $\eta_D$ with probability $1$ over $D_X$).
  • Figure 2: Probabilistic count example on $n$ cells with the same weight; in this case ${\mathrm{PC}_{D}}(f) =n$ is the number of cells.
  • Figure 3: Probabilistic count example on 9 cells including 5 small cells; the cells with small weights do not have much effect on the probabilistic count; while we have 9 cells, ${\mathrm{PC}_{D}}(f)$ is close to $4$; this shows that PC emphasizes the number of significant cells. ${\mathrm{PC}_{D}}(f)=\frac{1}{\frac{1}{4}^2\cdot3 + \frac{19}{80}^2 + \frac{1}{80}^2 \cdot 5} \approx 4.08$.
  • Figure 4: Probabilistic count example on three cells with different weights; ${\mathrm{PC}_{D}}(f)=\frac{1}{\frac{1}{4}^2\cdot2 + \frac{1}{2}^2} \approx 2.66$.
  • Figure 5: Probabilistic count example on 12 cells including 10 cells distributed on a third; for three equally weighted cells, the probabilistic count is 3; we have split the third partition into ten small cells with the same weights. ${\mathrm{PC}_{D}}(f)=\frac{1}{\frac{1}{3}^2\cdot2 + \frac{1}{30}^2\cdot10} \approx 4.28$.
  • ...and 6 more figures

Theorems & Definitions (21)

  • Definition 2.1
  • proof
  • Theorem 2
  • proof
  • Corollary 1
  • proof
  • Theorem 4
  • proof
  • Corollary 2
  • Definition 4.1: Cell merge with score averaging
  • ...and 11 more