Table of Contents
Fetching ...

Calibrating Where It Matters: Constrained Temperature Scaling

Stephen McKenna, Jacob Carse

TL;DR

The paper addresses calibration of deep classifiers for medical decision support when deployment cost functions are unknown but bounded. It extends post-hoc temperature scaling by introducing a constrained temperature $T^*$ that emphasizes calibration near decision boundaries by restricting the estimation to clinically relevant subsets, in binary and multi-class settings. Experiments on ISIC 2019 dermoscopy data with EfficientNet B7 and ResNet101 show that standard temperature scaling improves calibration and that the constrained $T^*$ yields further gains in boundary regions, particularly for $p<0.5$ in binary and for benign classes in multi-class. The results suggest targeted calibration can improve decision quality in clinical settings and can complement other calibration strategies and priors, with future work extending to more datasets and robustness to distribution shift.

Abstract

We consider calibration of convolutional classifiers for diagnostic decision making. Clinical decision makers can use calibrated classifiers to minimise expected costs given their own cost function. Such functions are usually unknown at training time. If minimising expected costs is the primary aim, algorithms should focus on tuning calibration in regions of probability simplex likely to effect decisions. We give an example, modifying temperature scaling calibration, and demonstrate improved calibration where it matters using convnets trained to classify dermoscopy images.

Calibrating Where It Matters: Constrained Temperature Scaling

TL;DR

The paper addresses calibration of deep classifiers for medical decision support when deployment cost functions are unknown but bounded. It extends post-hoc temperature scaling by introducing a constrained temperature that emphasizes calibration near decision boundaries by restricting the estimation to clinically relevant subsets, in binary and multi-class settings. Experiments on ISIC 2019 dermoscopy data with EfficientNet B7 and ResNet101 show that standard temperature scaling improves calibration and that the constrained yields further gains in boundary regions, particularly for in binary and for benign classes in multi-class. The results suggest targeted calibration can improve decision quality in clinical settings and can complement other calibration strategies and priors, with future work extending to more datasets and robustness to distribution shift.

Abstract

We consider calibration of convolutional classifiers for diagnostic decision making. Clinical decision makers can use calibrated classifiers to minimise expected costs given their own cost function. Such functions are usually unknown at training time. If minimising expected costs is the primary aim, algorithms should focus on tuning calibration in regions of probability simplex likely to effect decisions. We give an example, modifying temperature scaling calibration, and demonstrate improved calibration where it matters using convnets trained to classify dermoscopy images.
Paper Structure (4 sections, 1 figure, 2 tables)

This paper contains 4 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Reliability diagrams for binary EfficientNet B7, without calibration (left) and with modified temperature scaling (right). Vertical bars indicate $5\%-95\%$ bootstrap consistency intervals Brocker2007ReliabilityDiagrams.