Proximity-Informed Calibration for Deep Neural Networks

Miao Xiong; Ailin Deng; Pang Wei Koh; Jiaying Wu; Shen Li; Jianqing Xu; Bryan Hooi

Proximity-Informed Calibration for Deep Neural Networks

Miao Xiong, Ailin Deng, Pang Wei Koh, Jiaying Wu, Shen Li, Jianqing Xu, Bryan Hooi

TL;DR

ProCal is proposed, a plug-and-play algorithm with a theoretical guarantee to adjust sample confidence based on proximity bias that is effective in addressing proximity bias and improving calibration on balanced, long-tail, and distribution-shift settings under four metrics over various model architectures.

Abstract

Confidence calibration is central to providing accurate and interpretable uncertainty estimates, especially under safety-critical scenarios. However, we find that existing calibration algorithms often overlook the issue of *proximity bias*, a phenomenon where models tend to be more overconfident in low proximity data (i.e., data lying in the sparse region of the data distribution) compared to high proximity samples, and thus suffer from inconsistent miscalibration across different proximity samples. We examine the problem over 504 pretrained ImageNet models and observe that: 1) Proximity bias exists across a wide variety of model architectures and sizes; 2) Transformer-based models are relatively more susceptible to proximity bias than CNN-based models; 3) Proximity bias persists even after performing popular calibration algorithms like temperature scaling; 4) Models tend to overfit more heavily on low proximity samples than on high proximity samples. Motivated by the empirical findings, we propose ProCal, a plug-and-play algorithm with a theoretical guarantee to adjust sample confidence based on proximity. To further quantify the effectiveness of calibration algorithms in mitigating proximity bias, we introduce proximity-informed expected calibration error (PIECE) with theoretical analysis. We show that ProCal is effective in addressing proximity bias and improving calibration on balanced, long-tail, and distribution-shift settings under four metrics over various model architectures. We believe our findings on proximity bias will guide the development of *fairer and better-calibrated* models, contributing to the broader pursuit of trustworthy AI. Our code is available at: https://github.com/MiaoXiong2320/ProximityBias-Calibration.

Proximity-Informed Calibration for Deep Neural Networks

TL;DR

Abstract

Paper Structure (55 sections, 4 theorems, 34 equations, 14 figures, 7 tables, 2 algorithms)

This paper contains 55 sections, 4 theorems, 34 equations, 14 figures, 7 tables, 2 algorithms.

Introduction
Related Work
Confidence Calibration
Multicalibration
What is Proximity Bias?
Background
Proximity
Proximity Bias
Main Empirical Findings
Proximity-Informed ECE
How to Mitigate Proximity Bias?
Continuous Confidence: Density-Ratio Calibration
Discrete Confidence: Bin Mean-Shift
Theoretical Guarantee
Remark.
...and 40 more sections

Key Result

Theorem 4.2

Given any joint distribution $\pi(X,Y)$ and any classifier $f$ that outputs model confidence $\hat{P}$ for sample $X$, we have the following inequality, where equality holds only when there is no cancellation effect with respect to proximity:

Figures (14)

Figure 1: Samples with lower (higher) proximity tend to be more overconfident (underconfident). The results are conducted using XCiT, an Image Transformer, on the ImageNet validation set (All Samples). The sample's proximity is measured using the average distance to its nearest neighbors ($K=10$) in the validation set. We split samples into $10$ equal-size bins based on proximity and choose the bin with the highest proximity (High Proximity Samples) and lowest proximity (Low Proximity Samples).
Figure 2: Proximity bias analysis on $504$ public models. Each marker represents a model, where marker sizes indicate model parameter numbers and different colors/shapes represent different architectures. The bias index is computed using \ref{['eq:bias-index']} ($0$ indicates no proximity bias). Left: We observed the following: 1) Models with higher accuracy tend to have a larger bias index. 2) Proximity bias exists across a wide range of model architectures. 3) Transformer variants (e.g. DEiT, XCiT, CaiT, and SwinV2) have a relatively larger bias compared to convolution-based networks (e.g. VGG and ResNet variants). Right: Confidence calibrated by temperature scaling (Upper Right) is similar to the original model confidence w.r.t proximity bias. Our ProCal (Bottom Right) is effective in reducing proximity bias. Analysis of other existing calibration algorithms can be found in \ref{['appendix:findings']}.
Figure 3: Calibration errors on ImageNet across 504 timm models. Each point represents the calibration result of applying a calibration method to the model confidence. Marker colors indicate different calibration algorithms used. Among all calibration algorithms, our method consistently appears at the bottom of the plot. See \ref{['appendix:additional_experiment_results']}\ref{['fig:experiment-imagenet-allmodels-large-resolution']} for high resolution figures.
Figure 4: The model's accuracy difference between the training and validation set is more significant on low proximity samples (31.67%) compared to high proximity samples (0.6%). The discrepancy in accuracy between the training and validation sets increases as the samples approach to low proximity regions, despite the training dataset and validation set have overlapping proximity distributions.
Figure 5: Proximity bias analysis of the model confidence on $504$ public models. Each marker represents a model, where marker sizes indicate model parameter numbers and different colors/shapes represent different architectures. The bias index is computed using \ref{['eq:bias-index']} ($0$ indicates no proximity bias).
...and 9 more figures

Theorems & Definitions (8)

Definition 3.1
Example 4.1
Theorem 4.2: PIECE captures cancellation effect.
Theorem 5.1: Brier Score after Bin-Mean-Shift is asymptotically bounded by Brier Score before calibration
Theorem A.1: Brier Score after Bin-Mean-Shift is asymptotically bounded by Brier Score before calibration
proof
Theorem B.1: PIECE captures cancellation effect.
proof

Proximity-Informed Calibration for Deep Neural Networks

TL;DR

Abstract

Proximity-Informed Calibration for Deep Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (8)