Table of Contents
Fetching ...

Instance adaptive adversarial training: Improved accuracy tradeoffs in neural nets

Yogesh Balaji, Tom Goldstein, Judy Hoffman

TL;DR

This work introduces Instance Adaptive Adversarial Training (IAAT), which replaces a uniform adversarial radius with per-sample radii $\epsilon_i$, enforcing robustness within $\|\delta_i\|_\infty \le \epsilon_i$ and updating each $\epsilon_i$ online via a simple rule. By combining a warmup with a per-sample margin schedule, IAAT achieves a clearer improvement in clean accuracy at a given robustness level and maintains performance across a range of test perturbation sizes. Across CIFAR-10/100 and ImageNet, IAAT breaks the traditional robustness-accuracy Pareto frontier and yields interpretable radii that correlate with ambiguity near decision boundaries, while improving generalization to image corruptions. The approach has practical implications for deploying robust models in safety-critical settings where clean performance is essential.

Abstract

Adversarial training is by far the most successful strategy for improving robustness of neural networks to adversarial attacks. Despite its success as a defense mechanism, adversarial training fails to generalize well to unperturbed test set. We hypothesize that this poor generalization is a consequence of adversarial training with uniform perturbation radius around every training sample. Samples close to decision boundary can be morphed into a different class under a small perturbation budget, and enforcing large margins around these samples produce poor decision boundaries that generalize poorly. Motivated by this hypothesis, we propose instance adaptive adversarial training -- a technique that enforces sample-specific perturbation margins around every training sample. We show that using our approach, test accuracy on unperturbed samples improve with a marginal drop in robustness. Extensive experiments on CIFAR-10, CIFAR-100 and Imagenet datasets demonstrate the effectiveness of our proposed approach.

Instance adaptive adversarial training: Improved accuracy tradeoffs in neural nets

TL;DR

This work introduces Instance Adaptive Adversarial Training (IAAT), which replaces a uniform adversarial radius with per-sample radii , enforcing robustness within and updating each online via a simple rule. By combining a warmup with a per-sample margin schedule, IAAT achieves a clearer improvement in clean accuracy at a given robustness level and maintains performance across a range of test perturbation sizes. Across CIFAR-10/100 and ImageNet, IAAT breaks the traditional robustness-accuracy Pareto frontier and yields interpretable radii that correlate with ambiguity near decision boundaries, while improving generalization to image corruptions. The approach has practical implications for deploying robust models in safety-critical settings where clean performance is essential.

Abstract

Adversarial training is by far the most successful strategy for improving robustness of neural networks to adversarial attacks. Despite its success as a defense mechanism, adversarial training fails to generalize well to unperturbed test set. We hypothesize that this poor generalization is a consequence of adversarial training with uniform perturbation radius around every training sample. Samples close to decision boundary can be morphed into a different class under a small perturbation budget, and enforcing large margins around these samples produce poor decision boundaries that generalize poorly. Motivated by this hypothesis, we propose instance adaptive adversarial training -- a technique that enforces sample-specific perturbation margins around every training sample. We show that using our approach, test accuracy on unperturbed samples improve with a marginal drop in robustness. Extensive experiments on CIFAR-10, CIFAR-100 and Imagenet datasets demonstrate the effectiveness of our proposed approach.

Paper Structure

This paper contains 22 sections, 3 equations, 9 figures, 9 tables, 2 algorithms.

Figures (9)

  • Figure 1: Overview of instance adaptive adversarial training. Samples close to the decision boundary (bird on the left) have nearby samples from a different class (deer) within a small $L_p$ ball, making the constraints imposed by PGD-8 / PGD-16 adversarial training infeasible. Samples far from the decision boundary (deer on the right) can withstand large perturbations well beyond $\epsilon=8$. Our adaptive adversarial training correctly assigns the perturbation radius (shown in dotted line) so that samples within each $L_p$ ball maintain the same class.
  • Figure 2: Visualizing training samples and their perturbations. The left panel shows samples that are assigned small $\epsilon$ (displayed below images) during adaptive training. These images are close to class boundaries, and change class when perturbed with $\epsilon \ge 8$. The right panel show images that are assigned large $\epsilon.$ These lie far from the decision boundary, and retain class information even with very large perturbations. All $\epsilon$ live in the range $[0, 255]$
  • Figure 3: Tradeoffs between accuracy and robustness: Each blue dot denotes an adversarially trained model with a different $\epsilon$. Models trained using instance adaptive adversarial training are shown in red. Adaptive training breaks through the Pareto frontier achieved by plain adversarial training with a fixed $\epsilon$.
  • Figure 4: Plot of adversarial robustness over a sweep of test $\epsilon$
  • Figure 5: Visualizing $\epsilon$ progress of instance adaptive adversarial trianing. Plot on the left shows average $\epsilon$ of samples over epochs, while the plot on the right shows $\epsilon$ progress of three randomly chosen samples.
  • ...and 4 more figures