A Benchmark Study on Calibration

Linwei Tao; Younan Zhu; Haolan Guo; Minjing Dong; Chang Xu

A Benchmark Study on Calibration

Linwei Tao, Younan Zhu, Haolan Guo, Minjing Dong, Chang Xu

TL;DR

This research represents the first large-scale investigation into calibration properties and the premier study of calibration issues within NAS, and bridges an existing gap by exploring calibration within NAS.

Abstract

Deep neural networks are increasingly utilized in various machine learning tasks. However, as these models grow in complexity, they often face calibration issues, despite enhanced prediction accuracy. Many studies have endeavored to improve calibration performance through the use of specific loss functions, data preprocessing and training frameworks. Yet, investigations into calibration properties have been somewhat overlooked. Our study leverages the Neural Architecture Search (NAS) search space, offering an exhaustive model architecture space for thorough calibration properties exploration. We specifically create a model calibration dataset. This dataset evaluates 90 bin-based and 12 additional calibration measurements across 117,702 unique neural networks within the widely employed NATS-Bench search space. Our analysis aims to answer several longstanding questions in the field, using our proposed dataset: (i) Can model calibration be generalized across different datasets? (ii) Can robustness be used as a calibration measurement? (iii) How reliable are calibration metrics? (iv) Does a post-hoc calibration method affect all models uniformly? (v) How does calibration interact with accuracy? (vi) What is the impact of bin size on calibration measurement? (vii) Which architectural designs are beneficial for calibration? Additionally, our study bridges an existing gap by exploring calibration within NAS. By providing this dataset, we enable further research into NAS calibration. As far as we are aware, our research represents the first large-scale investigation into calibration properties and the premier study of calibration issues within NAS. The project page can be found at https://www.taolinwei.com/calibration-study

A Benchmark Study on Calibration

TL;DR

Abstract

Paper Structure (32 sections, 5 equations, 49 figures, 1 table, 1 algorithm)

This paper contains 32 sections, 5 equations, 49 figures, 1 table, 1 algorithm.

Introduction
Related Works
Dataset Generation
Architectures Evaluated
Experiments and Discussion
Can model calibration be generalized across different datasets?
Can robustness be used as a calibration measurement?
How reliable are calibration metrics?
Does a post-hoc calibration method affect all models uniformly?
How does calibration interact with accuracy?
What is the impact of bin size on calibration measurement?
Which architectural designs are beneficial for calibration?
Conclusions
Experiments on Transformers
Experiments on Other Calibration Methods
...and 17 more sections

Figures (49)

Figure 1: the macro skeleton of each candidate architecture at the top, cell representations at the bottom-left, and operation candidates at the bottom-right. The candidate channels for SSS are 8, 16, 24, 32, 40, 48, 56, and 64.
Figure 2: Kendall ranking correlation matrix of ECE for different TSS architecture subsets. The left subplot corresponds to the top 1000 architectures based on accuracy, while the right subplot represents the entire set of models.
Figure 3: Kendall ranking correlation of various metrics against ECE different top-ranked model population.
Figure 4: Explore the properties of calibration metrics.
Figure 5: Scatter plots depict the ECE versus Accuracy of model with accuracy larger than 90% (left) and all TSS models (right) on CIFAR-10. The color-coded markers represent CIFAR-10-C AUC scores.
...and 44 more figures

A Benchmark Study on Calibration

TL;DR

Abstract

A Benchmark Study on Calibration

Authors

TL;DR

Abstract

Table of Contents

Figures (49)