Decision from Suboptimal Classifiers: Excess Risk Pre- and Post-Calibration
Alexandre Perez-Lebel, Gael Varoquaux, Sanmi Koyejo, Matthieu Doutreligne, Marine Le Morvan
TL;DR
This work addresses how suboptimal probability estimates used in batch binary decisions degrade expected utility. It derives analytical expressions for calibration regret $R^{\mathrm{CL}}$ and tight bounds for grouping-loss regret $R^{\mathrm{GL}}$, enabling a decomposition $R_{f,t} = R^{\mathrm{CL}}_{f,t} + R^{\mathrm{GL}}_{f}$ that identifies when recalibration suffices and when post-training is beneficial. The authors introduce GLAR to reduce grouping loss and show that both $R^{\mathrm{CL}}$ and $R^{\mathrm{GL}}$ are practically estimable via calibration curves and the grouping-loss estimator, respectively. Experiments on NLP hate-speech tasks demonstrate that these regret quantities better predict the utility gains from recalibration and post-training than traditional metrics, and they advocate multicalibration as a cost-effective alternative to fine-tuning. Overall, the paper provides a practical decision-validation framework to guide post-training and highlights when cheaper recalibration or advanced post-training yields meaningful improvements.
Abstract
Probabilistic classifiers are central for making informed decisions under uncertainty. Based on the maximum expected utility principle, optimal decision rules can be derived using the posterior class probabilities and misclassification costs. Yet, in practice only learned approximations of the oracle posterior probabilities are available. In this work, we quantify the excess risk (a.k.a. regret) incurred using approximate posterior probabilities in batch binary decision-making. We provide analytical expressions for miscalibration-induced regret ($R^{\mathrm{CL}}$), as well as tight and informative upper and lower bounds on the regret of calibrated classifiers ($R^{\mathrm{GL}}$). These expressions allow us to identify regimes where recalibration alone addresses most of the regret, and regimes where the regret is dominated by the grouping loss, which calls for post-training beyond recalibration. Crucially, both $R^{\mathrm{CL}}$ and $R^{\mathrm{GL}}$ can be estimated in practice using a calibration curve and a recent grouping loss estimator. On NLP experiments, we show that these quantities identify when the expected gain of more advanced post-training is worth the operational cost. Finally, we highlight the potential of multicalibration approaches as efficient alternatives to costlier fine-tuning approaches.
