Adversarial Training and Robustness for Multiple Perturbations

Florian Tramèr; Dan Boneh

Adversarial Training and Robustness for Multiple Perturbations

Florian Tramèr, Dan Boneh

TL;DR

This paper investigates the possibility and limits of achieving simultaneous robustness to multiple perturbation types in classifiers. By formulating adversarial risk across perturbation sets and proving fundamental trade-offs in a simple statistical model, it shows that robustness to certain perturbations (e.g., $\ell_\infty$ vs. $L_1$) are inherently mutually exclusive, and that affine combinations of perturbations can be stronger than unions in some cases. To address these limits, the authors propose multi-perturbation adversarial training schemes (Avg and Max) and introduce the Sparse $L_1$ Descent Attack (SLIDE), along with empirical evaluations on MNIST and CIFAR10 that reveal partial gains but notable gaps compared to single-perturbation robustness and practical scalability concerns. The work also documents gradient-masking phenomena and analyzes affine perturbations, suggesting that robust multi-perturbation defenses will require fundamentally new approaches, including gradient-free attacks or certified defenses. Overall, the study clarifies the boundaries of multi-perturbation robustness and provides a path forward for more rigorous evaluation and defense design.

Abstract

Defenses against adversarial examples, such as adversarial training, are typically tailored to a single perturbation type (e.g., small $\ell_\infty$-noise). For other perturbations, these defenses offer no guarantees and, at times, even increase the model's vulnerability. Our aim is to understand the reasons underlying this robustness trade-off, and to train models that are simultaneously robust to multiple perturbation types. We prove that a trade-off in robustness to different types of $\ell_p$-bounded and spatial perturbations must exist in a natural and simple statistical setting. We corroborate our formal analysis by demonstrating similar robustness trade-offs on MNIST and CIFAR10. Building upon new multi-perturbation adversarial training schemes, and a novel efficient attack for finding $\ell_1$-bounded adversarial examples, we show that no model trained against multiple attacks achieves robustness competitive with that of models trained on each attack individually. In particular, we uncover a pernicious gradient-masking phenomenon on MNIST, which causes adversarial training with first-order $\ell_\infty, \ell_1$ and $\ell_2$ adversaries to achieve merely $50\%$ accuracy. Our results question the viability and computational scalability of extending adversarial robustness, and adversarial training, to multiple perturbation types.

Adversarial Training and Robustness for Multiple Perturbations

TL;DR

vs.

) are inherently mutually exclusive, and that affine combinations of perturbations can be stronger than unions in some cases. To address these limits, the authors propose multi-perturbation adversarial training schemes (Avg and Max) and introduce the Sparse

Descent Attack (SLIDE), along with empirical evaluations on MNIST and CIFAR10 that reveal partial gains but notable gaps compared to single-perturbation robustness and practical scalability concerns. The work also documents gradient-masking phenomena and analyzes affine perturbations, suggesting that robust multi-perturbation defenses will require fundamentally new approaches, including gradient-free attacks or certified defenses. Overall, the study clarifies the boundaries of multi-perturbation robustness and provides a path forward for more rigorous evaluation and defense design.

Abstract

Defenses against adversarial examples, such as adversarial training, are typically tailored to a single perturbation type (e.g., small

-noise). For other perturbations, these defenses offer no guarantees and, at times, even increase the model's vulnerability. Our aim is to understand the reasons underlying this robustness trade-off, and to train models that are simultaneously robust to multiple perturbation types. We prove that a trade-off in robustness to different types of

-bounded and spatial perturbations must exist in a natural and simple statistical setting. We corroborate our formal analysis by demonstrating similar robustness trade-offs on MNIST and CIFAR10. Building upon new multi-perturbation adversarial training schemes, and a novel efficient attack for finding

-bounded adversarial examples, we show that no model trained against multiple attacks achieves robustness competitive with that of models trained on each attack individually. In particular, we uncover a pernicious gradient-masking phenomenon on MNIST, which causes adversarial training with first-order

and

adversaries to achieve merely

accuracy. Our results question the viability and computational scalability of extending adversarial robustness, and adversarial training, to multiple perturbation types.

Adversarial Training and Robustness for Multiple Perturbations

TL;DR

Abstract

Adversarial Training and Robustness for Multiple Perturbations

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (5)

Theorems & Definitions (16)