Table of Contents
Fetching ...

Adversarial Training and Robustness for Multiple Perturbations

Florian Tramèr, Dan Boneh

TL;DR

This paper investigates the possibility and limits of achieving simultaneous robustness to multiple perturbation types in classifiers. By formulating adversarial risk across perturbation sets and proving fundamental trade-offs in a simple statistical model, it shows that robustness to certain perturbations (e.g., $\ell_\infty$ vs. $L_1$) are inherently mutually exclusive, and that affine combinations of perturbations can be stronger than unions in some cases. To address these limits, the authors propose multi-perturbation adversarial training schemes (Avg and Max) and introduce the Sparse $L_1$ Descent Attack (SLIDE), along with empirical evaluations on MNIST and CIFAR10 that reveal partial gains but notable gaps compared to single-perturbation robustness and practical scalability concerns. The work also documents gradient-masking phenomena and analyzes affine perturbations, suggesting that robust multi-perturbation defenses will require fundamentally new approaches, including gradient-free attacks or certified defenses. Overall, the study clarifies the boundaries of multi-perturbation robustness and provides a path forward for more rigorous evaluation and defense design.

Abstract

Defenses against adversarial examples, such as adversarial training, are typically tailored to a single perturbation type (e.g., small $\ell_\infty$-noise). For other perturbations, these defenses offer no guarantees and, at times, even increase the model's vulnerability. Our aim is to understand the reasons underlying this robustness trade-off, and to train models that are simultaneously robust to multiple perturbation types. We prove that a trade-off in robustness to different types of $\ell_p$-bounded and spatial perturbations must exist in a natural and simple statistical setting. We corroborate our formal analysis by demonstrating similar robustness trade-offs on MNIST and CIFAR10. Building upon new multi-perturbation adversarial training schemes, and a novel efficient attack for finding $\ell_1$-bounded adversarial examples, we show that no model trained against multiple attacks achieves robustness competitive with that of models trained on each attack individually. In particular, we uncover a pernicious gradient-masking phenomenon on MNIST, which causes adversarial training with first-order $\ell_\infty, \ell_1$ and $\ell_2$ adversaries to achieve merely $50\%$ accuracy. Our results question the viability and computational scalability of extending adversarial robustness, and adversarial training, to multiple perturbation types.

Adversarial Training and Robustness for Multiple Perturbations

TL;DR

This paper investigates the possibility and limits of achieving simultaneous robustness to multiple perturbation types in classifiers. By formulating adversarial risk across perturbation sets and proving fundamental trade-offs in a simple statistical model, it shows that robustness to certain perturbations (e.g., vs. ) are inherently mutually exclusive, and that affine combinations of perturbations can be stronger than unions in some cases. To address these limits, the authors propose multi-perturbation adversarial training schemes (Avg and Max) and introduce the Sparse Descent Attack (SLIDE), along with empirical evaluations on MNIST and CIFAR10 that reveal partial gains but notable gaps compared to single-perturbation robustness and practical scalability concerns. The work also documents gradient-masking phenomena and analyzes affine perturbations, suggesting that robust multi-perturbation defenses will require fundamentally new approaches, including gradient-free attacks or certified defenses. Overall, the study clarifies the boundaries of multi-perturbation robustness and provides a path forward for more rigorous evaluation and defense design.

Abstract

Defenses against adversarial examples, such as adversarial training, are typically tailored to a single perturbation type (e.g., small -noise). For other perturbations, these defenses offer no guarantees and, at times, even increase the model's vulnerability. Our aim is to understand the reasons underlying this robustness trade-off, and to train models that are simultaneously robust to multiple perturbation types. We prove that a trade-off in robustness to different types of -bounded and spatial perturbations must exist in a natural and simple statistical setting. We corroborate our formal analysis by demonstrating similar robustness trade-offs on MNIST and CIFAR10. Building upon new multi-perturbation adversarial training schemes, and a novel efficient attack for finding -bounded adversarial examples, we show that no model trained against multiple attacks achieves robustness competitive with that of models trained on each attack individually. In particular, we uncover a pernicious gradient-masking phenomenon on MNIST, which causes adversarial training with first-order and adversaries to achieve merely accuracy. Our results question the viability and computational scalability of extending adversarial robustness, and adversarial training, to multiple perturbation types.

Paper Structure

This paper contains 30 sections, 6 theorems, 35 equations, 5 figures, 5 tables, 1 algorithm.

Key Result

Theorem 1

Let $f$ be a classifier for $\mathcal{D}$. Let $S_{\infty}$ be the set of $\ell_\infty$-bounded perturbations with $\epsilon = 2\eta$, and $S_{1}$ the set of $\ell_1$-bounded perturbations with $\epsilon=2$. Then, $\mathcal{R}_{\text{adv}}^{\text{avg}}(f; S_{\infty}, S_{1}) \geq 1/2 \;.$

Figures (5)

  • Figure 1: Robustness trade-off on MNIST (top) and CIFAR10 (bottom). For a union of $\ell_p$-balls (left), or of $\ell_\infty$-noise and rotation-translations (RT) (right), we train models Adv$_\text{max}$ on the strongest perturbation-type for each input. We report the test accuracy of Adv$_\text{max}$ against each individual perturbation type (solid line) and against their union (dotted brown line). The vertical lines show the adversarial accuracy of models trained and evaluated on a single perturbation type.
  • Figure 2: Performance of the Sparse $\ell_1$-Descent Attack on MNIST (left) and CIFAR10 (right) for different choices of descent directions. We run the attack for up to $1{,}000$ steps and plot the evolution of the cross-entropy loss, for an undefended model. We vary the sparsity of the gradient updates (controlled by the parameter $q$), and compare to the standard PGD attack that uses the steepest descent vector, as well as the Frank-Wolfe $\ell_1$-attack from kang2019transfer. For appropriate $q$, our attack vastly outperforms PGD and Frank-Wolfe.
  • Figure 3: Gradient masking in an $\ell_\infty$-adversarially trained model on MNIST, evaluated against $\ell_1$-attacks (left) and $\ell_2$-attacks (right). The model is trained against an $\ell_\infty$-PGD adversary with $\epsilon=0.3$. For a randomly chosen data point $\mathbold{x}$, we compute an adversarial perturbation $\mathbold{r}_{\text{PGD}}$ using PGD and $\mathbold{r}_{\text{GF}}$ using a gradient-free attack. The left plot is for $\ell_1$-attacks with $\epsilon=10$ and the right plot is for $\ell_2$-attacks with $\epsilon=2$. The plots display the loss on points of the form $\hat{\mathbold{x}} \coloneqq \mathbold{x} + \alpha \cdot \mathbold{r}_{\text{PGD}} + \beta \cdot \mathbold{r}_{\text{GF}}$, for $\alpha, \beta \in [0, \epsilon]$. The loss surface behaves like a step-function, and gradient-free attacks succeed in finding adversarial examples where first-order methods failed.
  • Figure 4: Adversarial examples for $\ell_\infty$, $\ell_1$ and rotation-translation (RT) attacks, and affine combinations thereof. The first column in each subplot shows clean images. The following five images in each row linearly interpolate between two attack types, as described in Section \ref{['ssec:affine']}. Images marked in red are mis-classified by a model trained against both types of perturbations. Note that there are examples for which combining a rotation-translation and $\ell_\infty$-attack is stronger than either perturbation type individually.
  • Figure : The Sparse $\ell_1$ Descent Attack (SLIDE).$P_q(|\mathbold{g}|)$ denotes the $q$th percentile of $|\mathbold{g}|$ and $\Pi_{S_{1}^{\epsilon}}$ is the projection onto the $\ell_1$-ball (see duchi2008efficient).

Theorems & Definitions (16)

  • Theorem 1
  • Theorem 2
  • Claim 3
  • Theorem 4
  • Theorem 5: Berry-Esseen berry1941accuracy
  • Lemma 6
  • proof
  • Claim 7
  • proof
  • Claim 8
  • ...and 6 more