Table of Contents
Fetching ...

Confidence Calibration of Classifiers with Many Classes

Adrien LeCoz, Stéphane Herbin, Faouzi Adjed

TL;DR

The problem of calibrating a multiclass classifier into calibrating a single surrogate binary classifier is transformed into a single surrogate binary classifier, which allows for more efficient use of standard calibration methods.

Abstract

For classification models based on neural networks, the maximum predicted class probability is often used as a confidence score. This score rarely predicts well the probability of making a correct prediction and requires a post-processing calibration step. However, many confidence calibration methods fail for problems with many classes. To address this issue, we transform the problem of calibrating a multiclass classifier into calibrating a single surrogate binary classifier. This approach allows for more efficient use of standard calibration methods. We evaluate our approach on numerous neural networks used for image or text classification and show that it significantly enhances existing calibration methods.

Confidence Calibration of Classifiers with Many Classes

TL;DR

The problem of calibrating a multiclass classifier into calibrating a single surrogate binary classifier is transformed into a single surrogate binary classifier, which allows for more efficient use of standard calibration methods.

Abstract

For classification models based on neural networks, the maximum predicted class probability is often used as a confidence score. This score rarely predicts well the probability of making a correct prediction and requires a post-processing calibration step. However, many confidence calibration methods fail for problems with many classes. To address this issue, we transform the problem of calibrating a multiclass classifier into calibrating a single surrogate binary classifier. This approach allows for more efficient use of standard calibration methods. We evaluate our approach on numerous neural networks used for image or text classification and show that it significantly enhances existing calibration methods.

Paper Structure

This paper contains 43 sections, 11 equations, 5 figures, 12 tables, 3 algorithms.

Figures (5)

  • Figure 1: Reliability diagrams for ResNet-50 and ViT-B/16 when using Temperature Scaling (TS), Vector Scaling (VS), and Histogram Binning (HB) on ImageNet. The subscript TvA signifies that the TvA reformulation was used, and reg means our regularization (\ref{['eq:reg']}) was applied. Red bars show the differences between bin accuracy (blue bar) and accuracy for perfect calibration (dashed red line). As the methods improve the calibration, these differences are reduced and the average confidence (vertical dotted line) will get closer to the global accuracy (vertical dashed line).
  • Figure 2: Test ECE evolution during training with ResNet-50 on ImageNet. The combination of regularization and TvA prevents overfitting of Vector Scaling. Temperature Scaling with TvA is shown for reference.
  • Figure 3: Influence of the calibration set size for ResNet-101 on ImageNet. Binary methods at the top and scaling methods at the bottom.
  • Figure 4: Histogram of class probabilities for 3 random classes, for ViT-16/B on ImageNet.
  • Figure : Standard approach