Table of Contents
Fetching ...

Decision from Suboptimal Classifiers: Excess Risk Pre- and Post-Calibration

Alexandre Perez-Lebel, Gael Varoquaux, Sanmi Koyejo, Matthieu Doutreligne, Marine Le Morvan

TL;DR

This work addresses how suboptimal probability estimates used in batch binary decisions degrade expected utility. It derives analytical expressions for calibration regret $R^{\mathrm{CL}}$ and tight bounds for grouping-loss regret $R^{\mathrm{GL}}$, enabling a decomposition $R_{f,t} = R^{\mathrm{CL}}_{f,t} + R^{\mathrm{GL}}_{f}$ that identifies when recalibration suffices and when post-training is beneficial. The authors introduce GLAR to reduce grouping loss and show that both $R^{\mathrm{CL}}$ and $R^{\mathrm{GL}}$ are practically estimable via calibration curves and the grouping-loss estimator, respectively. Experiments on NLP hate-speech tasks demonstrate that these regret quantities better predict the utility gains from recalibration and post-training than traditional metrics, and they advocate multicalibration as a cost-effective alternative to fine-tuning. Overall, the paper provides a practical decision-validation framework to guide post-training and highlights when cheaper recalibration or advanced post-training yields meaningful improvements.

Abstract

Probabilistic classifiers are central for making informed decisions under uncertainty. Based on the maximum expected utility principle, optimal decision rules can be derived using the posterior class probabilities and misclassification costs. Yet, in practice only learned approximations of the oracle posterior probabilities are available. In this work, we quantify the excess risk (a.k.a. regret) incurred using approximate posterior probabilities in batch binary decision-making. We provide analytical expressions for miscalibration-induced regret ($R^{\mathrm{CL}}$), as well as tight and informative upper and lower bounds on the regret of calibrated classifiers ($R^{\mathrm{GL}}$). These expressions allow us to identify regimes where recalibration alone addresses most of the regret, and regimes where the regret is dominated by the grouping loss, which calls for post-training beyond recalibration. Crucially, both $R^{\mathrm{CL}}$ and $R^{\mathrm{GL}}$ can be estimated in practice using a calibration curve and a recent grouping loss estimator. On NLP experiments, we show that these quantities identify when the expected gain of more advanced post-training is worth the operational cost. Finally, we highlight the potential of multicalibration approaches as efficient alternatives to costlier fine-tuning approaches.

Decision from Suboptimal Classifiers: Excess Risk Pre- and Post-Calibration

TL;DR

This work addresses how suboptimal probability estimates used in batch binary decisions degrade expected utility. It derives analytical expressions for calibration regret and tight bounds for grouping-loss regret , enabling a decomposition that identifies when recalibration suffices and when post-training is beneficial. The authors introduce GLAR to reduce grouping loss and show that both and are practically estimable via calibration curves and the grouping-loss estimator, respectively. Experiments on NLP hate-speech tasks demonstrate that these regret quantities better predict the utility gains from recalibration and post-training than traditional metrics, and they advocate multicalibration as a cost-effective alternative to fine-tuning. Overall, the paper provides a practical decision-validation framework to guide post-training and highlights when cheaper recalibration or advanced post-training yields meaningful improvements.

Abstract

Probabilistic classifiers are central for making informed decisions under uncertainty. Based on the maximum expected utility principle, optimal decision rules can be derived using the posterior class probabilities and misclassification costs. Yet, in practice only learned approximations of the oracle posterior probabilities are available. In this work, we quantify the excess risk (a.k.a. regret) incurred using approximate posterior probabilities in batch binary decision-making. We provide analytical expressions for miscalibration-induced regret (), as well as tight and informative upper and lower bounds on the regret of calibrated classifiers (). These expressions allow us to identify regimes where recalibration alone addresses most of the regret, and regimes where the regret is dominated by the grouping loss, which calls for post-training beyond recalibration. Crucially, both and can be estimated in practice using a calibration curve and a recent grouping loss estimator. On NLP experiments, we show that these quantities identify when the expected gain of more advanced post-training is worth the operational cost. Finally, we highlight the potential of multicalibration approaches as efficient alternatives to costlier fine-tuning approaches.

Paper Structure

This paper contains 71 sections, 22 theorems, 76 equations, 20 figures.

Key Result

Proposition 3.0

Let $\mathcal{D}_{f}$ be the set of decision rules function of the estimated probabilities $f$. Then the calibrated probabilities thresholded at $t^\star$, maximize the conditional expected utility over $\mathcal{D}_{f}$, i.e.,

Figures (20)

  • Figure 1: Impact of miscalibration on the regret $R^\mathrm{CL}_{f\!,t}$. (a) The oracle decision $p \mapsto \mathds{1}_{p \geq t^{\star}}$ applied on miscalibrated estimated probabilities $f$, that is $\delta_{f, t^{\star}}$, yields a non zero regret $R^\mathrm{CL}_{{f}\!,t^{\star}}$ within areas of disagreement with amount $|c - t^{\star}|$ (red area). The regret $R^\mathrm{CL}_{f\!,t}$ can be reduced to 0 either by adapting the decision to a new threshold $t_f = c^{-1}(t^{\star})$, that is $\delta_{f, t_f}$ (b), or by recalibrating the estimated probabilities and using $t^{\star}$ as threshold, that is $\delta_{c\circ\!f, t^{\star}}$ (c) (\ref{['prop:th']}).
  • Figure 2: Impact of the grouping loss on the minimal regret $L^\mathrm{GL}_f$ of a recalibrated classifier. In a bin $p \in \operatorname{supp}f(X)$, the grouping loss exceeding $V_{\!\mathrm{min}}(p)$ incurs to the recalibrated classifier a nonzero regret $R^\mathrm{GL}_f(p)$ of at least $U_{\!\!\Delta} \left[\mathrm{GL}(p)-V_{\!\mathrm{min}}(p)\right]_{+}$ (\ref{['thm:regret:lb']}). Measuring a variance smaller than $V_{\!\mathrm{min}}(p)$ is not informative with respect to the grouping regret as there exists $(f^{\star}, f)$ where $\mathrm{GL}(p) = V_{\!\mathrm{min}}(p)$ and $R^\mathrm{GL}_f(p) = 0$. The variance cannot exceed $V_{\!\mathrm{max}}(p) \triangleq c(p)(1 - c(p))$. The informative area is highlighted in green.
  • Figure 3: Impact of the grouping loss on the bounds of the regret of the recalibrated classifier. Lower and upper bounds $L^\mathrm{GL}_f(p)$ and $U^\mathrm{GL}_f(p)$ as a function of the calibrated probabilities $c(p)\in[0, 1]$ for a bin $p \in \operatorname{supp}f(X)$, in three settings of grouping loss: maximal (a), intermediate (b) and small (c). The gap between the lower and upper bounds reduces when the grouping loss is high or low. $V_{\!\mathrm{max}} \triangleq c(p)(1 - c(p))$.
  • Figure 4: ${\hat{R}}^\mathrm{CL}_{f\!,t}$ captures the gain of recalibration. (a) Gain in utility of isotonic recalibration versus the regret to the recalibrated classifier ${\hat{R}}^\mathrm{CL}_{{f}\!,t^{\star}}$, for each (model, dataset, $t^{\star}$). (b) Pearson's $r^2$ correlation of the gain in utility of each recalibration method with ${\hat{R}}^\mathrm{CL}_{{f}\!,t^{\star}}$ and other metrics.
  • Figure 5: Gain of Post-Training on top of recalibration. (a) Gain in utility of fine-tuning over isotonic recalibration versus $\hat{R}^{\mathrm{GL}}_{{f}}$, for each (model, dataset, $t^{\star}$). (b) Pearson's $r^2$ correlation of the gain in utility over isotonic recalibration with $\hat{R}^{\mathrm{GL}}_{{f}}$, ${\hat{R}}^\mathrm{CL}_{{f}\!,t^{\star}}$, and other metrics.
  • ...and 15 more figures

Theorems & Definitions (40)

  • Proposition 3.0: Best decision given estimated probabilities, \ref{['sec:proof:best-decision:dh']}
  • Proposition 3.0: Expression of the calibration regret, \ref{['sec:proof:rcl:expression']}
  • Proposition 3.0: Adjusting the threshold $t_\hh$, \ref{['sec:proof:th']}
  • Theorem 3.1: Grouping regret lower bound, \ref{['proof:regret:lb']}
  • Theorem 3.2: Grouping regret upper bound, \ref{['proof:regret:ub']}
  • Definition 3.3: Regret estimators
  • Definition 3.4: GLAR
  • Lemma B.1: Parametrization of decision rules
  • proof : Proof of \ref{['lem:decision:parametrized']}
  • Lemma B.2: $\mathcal{D}_{f}$
  • ...and 30 more