Probabilistic Scores of Classifiers, Calibration is not Enough

Agathe Fernandes Machado; Arthur Charpentier; Emmanuel Flachaire; Ewen Gallic; François Hu

Probabilistic Scores of Classifiers, Calibration is not Enough

Agathe Fernandes Machado, Arthur Charpentier, Emmanuel Flachaire, Ewen Gallic, François Hu

TL;DR

This work shows that calibration metrics alone can mislead when classifier score distributions do not match the true probability distribution. It proposes optimizing KL divergence between the predicted score distribution and the true probability distribution as a model-selection objective, particularly for tree-based methods like Random Forest and XGBoost. Across synthetic DGPs and ten real-world UCI datasets with Beta priors, KL-based tuning achieves substantially better alignment of scores with true probabilities at the cost of only a small loss in discrimination, and sometimes yields pronounced improvements in probability representativeness. The findings suggest that in decision contexts requiring accurate probability estimation, KL-based model selection provides practical advantages over conventional metrics such as AUC, BS, or ICI, especially when prior information about the probability distribution is available.

Abstract

In binary classification tasks, accurate representation of probabilistic predictions is essential for various real-world applications such as predicting payment defaults or assessing medical risks. The model must then be well-calibrated to ensure alignment between predicted probabilities and actual outcomes. However, when score heterogeneity deviates from the underlying data probability distribution, traditional calibration metrics lose reliability, failing to align score distribution with actual probabilities. In this study, we highlight approaches that prioritize optimizing the alignment between predicted scores and true probability distributions over minimizing traditional performance or calibration metrics. When employing tree-based models such as Random Forest and XGBoost, our analysis emphasizes the flexibility these models offer in tuning hyperparameters to minimize the Kullback-Leibler (KL) divergence between predicted and true distributions. Through extensive empirical analysis across 10 UCI datasets and simulations, we demonstrate that optimizing tree-based models based on KL divergence yields superior alignment between predicted scores and actual probabilities without significant performance loss. In real-world scenarios, the reference probability is determined a priori as a Beta distribution estimated through maximum likelihood. Conversely, minimizing traditional calibration metrics may lead to suboptimal results, characterized by notable performance declines and inferior KL values. Our findings reveal limitations in traditional calibration metrics, which could undermine the reliability of predictive models for critical decision-making.

Probabilistic Scores of Classifiers, Calibration is not Enough

TL;DR

Abstract

Paper Structure (45 sections, 3 theorems, 16 equations, 26 figures, 17 tables, 2 algorithms)

This paper contains 45 sections, 3 theorems, 16 equations, 26 figures, 17 tables, 2 algorithms.

Introduction
Related Work and Metrics
Contributions and Findings
Calibration of a Binary Classifier
Calibration of Well-Specified Logistic Regression
Calibration Curve
Calibration Metrics
Scores Heterogeneity
Decomposition
Synthetic Data
Illustrative Example: Regression Trees
Replications
Simulations
Synthetic Data
Real-World Data
...and 30 more sections

Key Result

Proposition 2.1

Consider a dataset $\{(y_i,\mathbf{x{_i}})\}$, where $\mathbf{x}$ are $k$ features ($k$ being fixed), so that $Y|\boldsymbol{X}=\mathbf{x} \sim \mathcal{B}(s(\mathbf{x}))$ where $s(\mathbf{x})={[1+\exp[-(\beta_0+\mathbf{x}^\top\boldsymbol{\beta})]]}$. Let $\widehat{\beta}_0$ and $\widehat{\boldsymbo

Figures (26)

Figure 1: Distribution of the underlying probabilities in the different categories of scenarios.
Figure 2: Distribution of scores on test set for trees (DGP 1, without noise variables).
Figure 3: KL Divergence and Calibration (DGP 1) across increasing average number of tree leaves. Dashed lines represent values when maximizing AUC. Arrows indicate increasing number of leaves.
Figure C1: Distribution of true probabilities and estimated scores for Trees under DGP 1: single replication across various numbers of noise variables.
Figure C2: Distribution of true probabilities and estimated scores for Trees under DGP 2: single replication across various numbers of noise variables.
...and 21 more figures

Theorems & Definitions (4)

Proposition 2.1
proof
Lemma 3.1: Adapted from brocker2009reliability
Lemma 3.2: Adapted from kull2015novel

Probabilistic Scores of Classifiers, Calibration is not Enough

TL;DR

Abstract

Probabilistic Scores of Classifiers, Calibration is not Enough

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (26)

Theorems & Definitions (4)