Multiclass Calibration Assessment and Recalibration of Probability Predictions via the Linear Log Odds Calibration Function

Amy Vennos; Xin Xing; Christopher T. Franck

Multiclass Calibration Assessment and Recalibration of Probability Predictions via the Linear Log Odds Calibration Function

Amy Vennos, Xin Xing, Christopher T. Franck

TL;DR

Multicategory Linear Log Odds (MCLLO) recalibration is proposed, which includes a likelihood ratio hypothesis test to assess calibration, does not require under-the-hood access to models and is thus applicable on a wide range of classification problems, and can be easily interpreted.

Abstract

Machine-generated probability predictions are essential in modern classification tasks such as image classification. A model is well calibrated when its predicted probabilities correspond to observed event frequencies. Despite the need for multicategory recalibration methods, existing methods are limited to (i) comparing calibration between two or more models rather than directly assessing the calibration of a single model, (ii) requiring under-the-hood model access, e.g., accessing logit-scale predictions within the layers of a neural network, and (iii) providing output which is difficult for human analysts to understand. To overcome (i)-(iii), we propose Multicategory Linear Log Odds (MCLLO) recalibration, which (i) includes a likelihood ratio hypothesis test to assess calibration, (ii) does not require under-the-hood access to models and is thus applicable on a wide range of classification problems, and (iii) can be easily interpreted. We demonstrate the effectiveness of the MCLLO method through simulations and three real-world case studies involving image classification via convolutional neural network, obesity analysis via random forest, and ecology via regression modeling. We compare MCLLO to four comparator recalibration techniques utilizing both our hypothesis test and the existing calibration metric Expected Calibration Error to show that our method works well alone and in concert with other methods.

Multiclass Calibration Assessment and Recalibration of Probability Predictions via the Linear Log Odds Calibration Function

TL;DR

Abstract

Paper Structure (25 sections, 4 theorems, 16 equations, 6 figures)

This paper contains 25 sections, 4 theorems, 16 equations, 6 figures.

Introduction
Multicategory Linear Log Odds Methodology
Analytic Form of Recalibrated Probability Predictions
MCLLO Likelihood
Recalibration of Probability Predictions
Likelihood Ratio Test
Theoretical Properties
Simulation Study
Simulation Study Results: LRT
Simulation Study: ECE
Case Studies
Comparator Methods
Case Study: Image Classification
Calibration Assessment for Image Classification
Image Classification: MCLLO Recalibration
...and 10 more sections

Key Result

Lemma 1

The negative log of the MCLLO likelihood in Equation (5) is convex in $\bm{\tau} = \log \bm{\delta}$ and $\bm{\gamma}$.

Figures (6)

Figure 1: Two images of observed label "plane" from the CIFAR-10 data set. A neural net outputted confidence scores of a set of ten labels for each image and reported high confidence scores for the label of "plane" for both images. However, the image on the right is less recognizable as a "plane" than the image on the left. Direct recalibration via MCLLO as described in Section (\ref{['sec:Methods']}) adjusts the probability of each label to correspond better with the rates that planes occur given data such as these images.
Figure 2: Thirty simulations of 1000 Monte Carlo repetitions were run with varying sample sizes and effect sizes. The rejection rates for each simulation are tabulated in the supplementary material and visualized in this figure, showing the effect of sample size and effect size on the power of our LRT.
Figure 3: Five simulations of 1000 Monte Carlo Repetitions (with sample size $n=5000$) were run with varying effect sizes. The ECE of the original probability predictions $ECE(\mathbf{X})$ for each repetition are plotted in navy, and the ECE of the MCLLO recalibrated probability predictions $ECE(\mathbf{X}^*)$ for each repetition are plotted in green.
Figure 4: Twenty simulations of 1000 Monte Carlo repetitions were run with varying sample sizes and effect sizes. The mean difference in ECE scores $ECE(\mathbf{X}) - ECE(\mathbf{X}^*)$ for each simulation are visualized.
Figure 5: Reliability diagrams for the CIFAR holdout confidence scores before [top left] and after different methods of recalibration, where well-calibrated predictions lie along the $x=y$ line. The top row shows (left to right) the original uncalibrated holdout confidence scores $\mathbf{X}_h$, MCLLO-recalibrated confidence scores $\mathbf{X}_{h,\text{MCLLO}}^*$, and temperature scaled confidence scores $\mathbf{X}_{h, TS}^*$. The bottom row shows (left to right) vector scaled confidence scores $\mathbf{X}_{h,VS}^*$, confidence scores obtained through the extension of binning as described by guo2017calibration, denoted $\mathbf{X}_{h, EB}^*$, and those obtained via the recalibration method by xudavoine, denoted $\mathbf{X}_{h,Xu}^*$.
...and 1 more figures

Theorems & Definitions (8)

Lemma 1
proof
Theorem 1
proof
Lemma 2
proof
Theorem 2
proof

Multiclass Calibration Assessment and Recalibration of Probability Predictions via the Linear Log Odds Calibration Function

TL;DR

Abstract

Multiclass Calibration Assessment and Recalibration of Probability Predictions via the Linear Log Odds Calibration Function

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (8)