Table of Contents
Fetching ...

From Uncertainty to Precision: Enhancing Binary Classifier Performance through Calibration

Agathe Fernandes Machado, Arthur Charpentier, Emmanuel Flachaire, Ewen Gallic, François Hu

TL;DR

The paper tackles the misalignment between discriminative performance and probabilistic calibration in binary classification, arguing that well-calibrated scores are essential for decision-making in finance and healthcare. It introduces the Local Calibration Score (LCS), a calibration metric based on a smooth calibration curve derived from local regression, and compares it to traditional quantile-based metrics like ECE. Through synthetic data with known probabilities and a real-world credit-default dataset, the authors show LCS more accurately tracks true miscalibration and that recalibration methods (Platt, isotonic, Beta, and local regression) improve calibration, sometimes at a small cost to discriminative ability (AUC). They further demonstrate that optimizing solely for AUC can degrade calibration, and that using a regression-based Random Forest yields better calibration than a classifier, underscoring the practical value of calibration-aware model tuning.

Abstract

The assessment of binary classifier performance traditionally centers on discriminative ability using metrics, such as accuracy. However, these metrics often disregard the model's inherent uncertainty, especially when dealing with sensitive decision-making domains, such as finance or healthcare. Given that model-predicted scores are commonly seen as event probabilities, calibration is crucial for accurate interpretation. In our study, we analyze the sensitivity of various calibration measures to score distortions and introduce a refined metric, the Local Calibration Score. Comparing recalibration methods, we advocate for local regressions, emphasizing their dual role as effective recalibration tools and facilitators of smoother visualizations. We apply these findings in a real-world scenario using Random Forest classifier and regressor to predict credit default while simultaneously measuring calibration during performance optimization.

From Uncertainty to Precision: Enhancing Binary Classifier Performance through Calibration

TL;DR

The paper tackles the misalignment between discriminative performance and probabilistic calibration in binary classification, arguing that well-calibrated scores are essential for decision-making in finance and healthcare. It introduces the Local Calibration Score (LCS), a calibration metric based on a smooth calibration curve derived from local regression, and compares it to traditional quantile-based metrics like ECE. Through synthetic data with known probabilities and a real-world credit-default dataset, the authors show LCS more accurately tracks true miscalibration and that recalibration methods (Platt, isotonic, Beta, and local regression) improve calibration, sometimes at a small cost to discriminative ability (AUC). They further demonstrate that optimizing solely for AUC can degrade calibration, and that using a regression-based Random Forest yields better calibration than a classifier, underscoring the practical value of calibration-aware model tuning.

Abstract

The assessment of binary classifier performance traditionally centers on discriminative ability using metrics, such as accuracy. However, these metrics often disregard the model's inherent uncertainty, especially when dealing with sensitive decision-making domains, such as finance or healthcare. Given that model-predicted scores are commonly seen as event probabilities, calibration is crucial for accurate interpretation. In our study, we analyze the sensitivity of various calibration measures to score distortions and introduce a refined metric, the Local Calibration Score. Comparing recalibration methods, we advocate for local regressions, emphasizing their dual role as effective recalibration tools and facilitators of smoother visualizations. We apply these findings in a real-world scenario using Random Forest classifier and regressor to predict credit default while simultaneously measuring calibration during performance optimization.
Paper Structure (34 sections, 1 theorem, 22 equations, 20 figures, 1 table)

This paper contains 34 sections, 1 theorem, 22 equations, 20 figures, 1 table.

Key Result

Proposition 2.1

Consider a dataset $\{(d_i,\mathbf{x{_i}})\}$, where $\mathbf{x}$ are $k$ features ($k$ being fixed), so that $D|\boldsymbol{X}=\mathbf{x} \sim \mathcal{B}(s(\mathbf{x}))$ where Let $\widehat{\beta}_0$ and $\widehat{\boldsymbol{\beta}}$ denote maximum likelihood estimators. Then, for any $\mathbf{x}$, the score is defined as is well-calibrated in the sense that

Figures (20)

  • Figure 1: Distorted Probabilities as a Function of True Probabilities, Depending on the Value of $\alpha$ (left) or $\gamma$ (right).
  • Figure 2: Calibration Metrics on 200 Simulations for each Value of $\alpha$ (top) or $\gamma$ (bottom).
  • Figure 3: Calibration Curve Obtained with Local Regression, on 200 simulations for each Value of $\alpha$ (top) or $\gamma$ (bottom). Distribution of the true probabilities are shown in the histograms (gold for $d=1$, purple for $d=0$).
  • Figure 4: Standard Goodness of Fit Metrics on 200 Simulations for each Value of $\alpha$ (top) or $\gamma$ (bottom). The probability threshold is set to $\tau=0.5$.
  • Figure 5: Metrics After Recalibration (for $\gamma=3$), on the Calibration (transparent colors) and on the Test Set (full colors).
  • ...and 15 more figures

Theorems & Definitions (2)

  • Proposition 2.1
  • proof