From Uncertainty to Precision: Enhancing Binary Classifier Performance through Calibration
Agathe Fernandes Machado, Arthur Charpentier, Emmanuel Flachaire, Ewen Gallic, François Hu
TL;DR
The paper tackles the misalignment between discriminative performance and probabilistic calibration in binary classification, arguing that well-calibrated scores are essential for decision-making in finance and healthcare. It introduces the Local Calibration Score (LCS), a calibration metric based on a smooth calibration curve derived from local regression, and compares it to traditional quantile-based metrics like ECE. Through synthetic data with known probabilities and a real-world credit-default dataset, the authors show LCS more accurately tracks true miscalibration and that recalibration methods (Platt, isotonic, Beta, and local regression) improve calibration, sometimes at a small cost to discriminative ability (AUC). They further demonstrate that optimizing solely for AUC can degrade calibration, and that using a regression-based Random Forest yields better calibration than a classifier, underscoring the practical value of calibration-aware model tuning.
Abstract
The assessment of binary classifier performance traditionally centers on discriminative ability using metrics, such as accuracy. However, these metrics often disregard the model's inherent uncertainty, especially when dealing with sensitive decision-making domains, such as finance or healthcare. Given that model-predicted scores are commonly seen as event probabilities, calibration is crucial for accurate interpretation. In our study, we analyze the sensitivity of various calibration measures to score distortions and introduce a refined metric, the Local Calibration Score. Comparing recalibration methods, we advocate for local regressions, emphasizing their dual role as effective recalibration tools and facilitators of smoother visualizations. We apply these findings in a real-world scenario using Random Forest classifier and regressor to predict credit default while simultaneously measuring calibration during performance optimization.
