Adversarial Training and Robustness for Multiple Perturbations
Florian Tramèr, Dan Boneh
TL;DR
This paper investigates the possibility and limits of achieving simultaneous robustness to multiple perturbation types in classifiers. By formulating adversarial risk across perturbation sets and proving fundamental trade-offs in a simple statistical model, it shows that robustness to certain perturbations (e.g., $\ell_\infty$ vs. $L_1$) are inherently mutually exclusive, and that affine combinations of perturbations can be stronger than unions in some cases. To address these limits, the authors propose multi-perturbation adversarial training schemes (Avg and Max) and introduce the Sparse $L_1$ Descent Attack (SLIDE), along with empirical evaluations on MNIST and CIFAR10 that reveal partial gains but notable gaps compared to single-perturbation robustness and practical scalability concerns. The work also documents gradient-masking phenomena and analyzes affine perturbations, suggesting that robust multi-perturbation defenses will require fundamentally new approaches, including gradient-free attacks or certified defenses. Overall, the study clarifies the boundaries of multi-perturbation robustness and provides a path forward for more rigorous evaluation and defense design.
Abstract
Defenses against adversarial examples, such as adversarial training, are typically tailored to a single perturbation type (e.g., small $\ell_\infty$-noise). For other perturbations, these defenses offer no guarantees and, at times, even increase the model's vulnerability. Our aim is to understand the reasons underlying this robustness trade-off, and to train models that are simultaneously robust to multiple perturbation types. We prove that a trade-off in robustness to different types of $\ell_p$-bounded and spatial perturbations must exist in a natural and simple statistical setting. We corroborate our formal analysis by demonstrating similar robustness trade-offs on MNIST and CIFAR10. Building upon new multi-perturbation adversarial training schemes, and a novel efficient attack for finding $\ell_1$-bounded adversarial examples, we show that no model trained against multiple attacks achieves robustness competitive with that of models trained on each attack individually. In particular, we uncover a pernicious gradient-masking phenomenon on MNIST, which causes adversarial training with first-order $\ell_\infty, \ell_1$ and $\ell_2$ adversaries to achieve merely $50\%$ accuracy. Our results question the viability and computational scalability of extending adversarial robustness, and adversarial training, to multiple perturbation types.
