Extreme Miscalibration and the Illusion of Adversarial Robustness

Vyas Raina; Samson Tan; Volkan Cevher; Aditya Rawal; Sheng Zha; George Karypis

Extreme Miscalibration and the Illusion of Adversarial Robustness

Vyas Raina, Samson Tan, Volkan Cevher, Aditya Rawal, Sheng Zha, George Karypis

TL;DR

The NLP community is urged to incorporate test-time temperature scaling into their robustness evaluations to ensure that any observed gains are genuine, and it is shown how the temperature can be scaled during training to improve genuine robustness.

Abstract

Deep learning-based Natural Language Processing (NLP) models are vulnerable to adversarial attacks, where small perturbations can cause a model to misclassify. Adversarial Training (AT) is often used to increase model robustness. However, we have discovered an intriguing phenomenon: deliberately or accidentally miscalibrating models masks gradients in a way that interferes with adversarial attack search methods, giving rise to an apparent increase in robustness. We show that this observed gain in robustness is an illusion of robustness (IOR), and demonstrate how an adversary can perform various forms of test-time temperature calibration to nullify the aforementioned interference and allow the adversarial attack to find adversarial examples. Hence, we urge the NLP community to incorporate test-time temperature scaling into their robustness evaluations to ensure that any observed gains are genuine. Finally, we show how the temperature can be scaled during \textit{training} to improve genuine robustness.

Extreme Miscalibration and the Illusion of Adversarial Robustness

TL;DR

Abstract

Paper Structure (63 sections, 16 equations, 4 figures, 37 tables)

This paper contains 63 sections, 16 equations, 4 figures, 37 tables.

Introduction
Background and Related Work
Adversarial Attacks
Adversarial Training
Model Calibration
Obfuscated Gradients
The Illusion of Robustness
Explicit: Test-time Temperature Scaling
Implicit Overconfidence: Grad. Norm.
Experiments
Data.
Models.
Adversarial attacks.
Explicit temperature scaling.
AT approaches.
...and 48 more sections

Figures (4)

Figure 1: Accuracy on adversarial examples from out-of-the-box adversarial attack for models with different average predicted class confidence, $E_{p(\mathbf x)} [P_{\hat{\theta}}(\hat{c}|\mathbf x)]$. Extremely overconfident and underconfident models show increased robustness. We reveal that this increased robustness is merely an illusion of robustness.
Figure 2: Change in post-calibration accuracy on Rotten Tomatoes as training temperature varies. We observe that a higher temperature during training increases robustness against unseen attacks (bae, tf, pwws, dg here). The change in adversarial accuracy relative to the baseline ($T=1$) demonstrates the increase in robustness.
Figure 3: Probability Density (histogram plot) of predicted class logits' range (smallest logit subtracted from largest logit value) on rt test set with and without a high training temperature for the baseline ST DeBERTa model. The higher temperature training setting ($T=100$) has a larger class logits' range, suggesting that an adversarial attack has to make a greater change in the logit space to be successful in changing the predicted class.
Figure 4: The use of a training temperature, $T$, is a simple adjustment in standard model training (ST), where the temperature parameter, $T$, is used to scale down predicted model logits. Higher training temperatures enhance model robustness against unseen adversarial attacks (bae, tf, pwws, dg) without requiring prior knowledge of these attack forms during training. This increased robustness is quantified by the absolute change in adversarial accuracy compared to the baseline $T=1$ ST model.

Extreme Miscalibration and the Illusion of Adversarial Robustness

TL;DR

Abstract

Extreme Miscalibration and the Illusion of Adversarial Robustness

Authors

TL;DR

Abstract

Table of Contents

Figures (4)