Calibration Attacks: A Comprehensive Study of Adversarial Attacks on Model Confidence

Stephen Obadinma; Xiaodan Zhu; Hongyu Guo

Calibration Attacks: A Comprehensive Study of Adversarial Attacks on Model Confidence

Stephen Obadinma, Xiaodan Zhu, Hongyu Guo

TL;DR

This paper investigates calibration attacks that target model confidence without flipping predictions, revealing a new class of adversarial threats. It formalizes four attack forms (underconfidence, overconfidence, maximum miscalibration, random confidence) and evaluates them under black-box and white-box settings on CNNs and Vision Transformers, showing substantial miscalibration with minimal accuracy loss. The authors also explore defenses, introducing Calibration Attack Adversarial Training (CAAT) and Compression Scaling (CS), and assess a broad set of recalibration methods using $ECE$ and $KS$ metrics, noting that the maximum miscalibration attack can theoretically reach an upper bound of $1 - q/K$. Results demonstrate that calibration attacks can induce severe miscalibration across architectures and data, while current defenses and recalibration strategies show significant limitations, underscoring the need for robust countermeasures in safety-critical deployments.

Abstract

In this work, we highlight and perform a comprehensive study on calibration attacks, a form of adversarial attacks that aim to trap victim models to be heavily miscalibrated without altering their predicted labels, hence endangering the trustworthiness of the models and follow-up decision making based on their confidence. We propose four typical forms of calibration attacks: underconfidence, overconfidence, maximum miscalibration, and random confidence attacks, conducted in both black-box and white-box setups. We demonstrate that the attacks are highly effective on both convolutional and attention-based models: with a small number of queries, they seriously skew confidence without changing the predictive performance. Given the potential danger, we further investigate the effectiveness of a wide range of adversarial defence and recalibration methods, including our proposed defences specifically designed for calibration attacks to mitigate the harm. From the ECE and KS scores, we observe that there are still significant limitations in handling calibration attacks. To the best of our knowledge, this is the first dedicated study that provides a comprehensive investigation on calibration-focused attacks. We hope this study helps attract more attention to these types of attacks and hence hamper their potential serious damages. To this end, this work also provides detailed analyses to understand the characteristics of the attacks. Our code is available at https://github.com/PhenetOs/CalibrationAttack

Calibration Attacks: A Comprehensive Study of Adversarial Attacks on Model Confidence

TL;DR

and

metrics, noting that the maximum miscalibration attack can theoretically reach an upper bound of

. Results demonstrate that calibration attacks can induce severe miscalibration across architectures and data, while current defenses and recalibration strategies show significant limitations, underscoring the need for robust countermeasures in safety-critical deployments.

Abstract

Paper Structure (44 sections, 1 theorem, 7 equations, 7 figures, 17 tables, 1 algorithm)

This paper contains 44 sections, 1 theorem, 7 equations, 7 figures, 17 tables, 1 algorithm.

Introduction
Related Work
Calibration of Machine Learning Models.
Adversarial Attacks and Training.
Attacking Uncertainty Estimates.
Calibration Attacks
Objective of Calibration Attacks
Four Forms of Calibration Attacks
Defence Against Calibration Attacks
Discussion on the Importance of Remaining Well Calibrated Under the Attacks
Experiments
Experimental Setup
Overall Performance
Detection Difficulty Analysis
Insights on Key Aspects of Attacks
...and 29 more sections

Key Result

Proposition 3.1

Assume $q$ is the accuracy of a K-way classifier $\mathcal{F}$ on the dataset $\mathcal{D} = \{\langle\mathbf{x}_n,y_n\rangle\}^N_{n=1}$. The Maximum Miscalibration Attack (MMA) maximizes the expected calibration error (ECE). The upper bound of ECE that can be achieved by MMA is $1-q/K$.

Figures (7)

Figure 1: Reliability diagrams of a ResNet-50 classifier (fine-tuned and tested on Caltech-101) before and after the four forms of calibration attacks. Red bars show the average accuracy on the test data binned by confidence scores (15 bins) and the blue bars are the average confidence of samples in each bin. The x-axis represents the bins and y-axis is the accuracy (for red bars) or confidence (for blue bars). The yellow line represents perfect calibration. To have the minimum possible ECE the red bars and blue bars have to completely overlap in each bin (shown in maroon), where no overlap represents miscalibration. Despite the accuracy being unchanged, the miscalibration is severe after the attacks.
Figure 2: The influence of perturbation noise levels$\epsilon$ (the left three sub-figures) and attack iterations (the right sub-figure). Sub-figure-1 (the left most) presents the comparison between the ECE scores of the different calibration attacks at different $\epsilon$ values using ResNet-50 models trained on CIFAR-100. Sub-figure-2: ECE vs. $\epsilon$ using maximum miscalibration attacks on ViT models trained on CIFAR-100. Sub-figure-3: ECE vs. $\epsilon$ using maximum miscalibration attacks on the ResNet-50 models trained on Caltech-101 and GTSRB. Sub-figure-4: Effect of the numbers of attack iterations on the ability of the attack algorithm. The first three sub-figures are created at the $1000^{th}$ attack iteration.
Figure 3: The GradCAM visualizations shows the image regions most responsible for the decisions of ResNet-50 before (top row) and after (bottom row) attacks. The left three images are under UCA and the right three the OCA.
Figure 4: The contrast between the effects on accuracy and ECE between the original version of the Square Attack algorithm and the maximum variation of the calibration attack algorithm at 1000 attack iterations. (Top) ResNet-50 results. (Bottom) ViT results.
Figure 5: t-SNE visualization of the effect of different forms of calibration attacks on a ResNet model trained and tested on a binary subset from CIFAR-100, with the test set (consisting of 200 data points) results being displayed. In the order from top left to bottom right, the plots for the pre-attack (vanilla model), and the UCA, OCA, RCA, and MMA variations can be seen.
...and 2 more figures

Theorems & Definitions (2)

Proposition 3.1
proof

Calibration Attacks: A Comprehensive Study of Adversarial Attacks on Model Confidence

TL;DR

Abstract

Calibration Attacks: A Comprehensive Study of Adversarial Attacks on Model Confidence

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (2)