Instance adaptive adversarial training: Improved accuracy tradeoffs in neural nets
Yogesh Balaji, Tom Goldstein, Judy Hoffman
TL;DR
This work introduces Instance Adaptive Adversarial Training (IAAT), which replaces a uniform adversarial radius with per-sample radii $\epsilon_i$, enforcing robustness within $\|\delta_i\|_\infty \le \epsilon_i$ and updating each $\epsilon_i$ online via a simple rule. By combining a warmup with a per-sample margin schedule, IAAT achieves a clearer improvement in clean accuracy at a given robustness level and maintains performance across a range of test perturbation sizes. Across CIFAR-10/100 and ImageNet, IAAT breaks the traditional robustness-accuracy Pareto frontier and yields interpretable radii that correlate with ambiguity near decision boundaries, while improving generalization to image corruptions. The approach has practical implications for deploying robust models in safety-critical settings where clean performance is essential.
Abstract
Adversarial training is by far the most successful strategy for improving robustness of neural networks to adversarial attacks. Despite its success as a defense mechanism, adversarial training fails to generalize well to unperturbed test set. We hypothesize that this poor generalization is a consequence of adversarial training with uniform perturbation radius around every training sample. Samples close to decision boundary can be morphed into a different class under a small perturbation budget, and enforcing large margins around these samples produce poor decision boundaries that generalize poorly. Motivated by this hypothesis, we propose instance adaptive adversarial training -- a technique that enforces sample-specific perturbation margins around every training sample. We show that using our approach, test accuracy on unperturbed samples improve with a marginal drop in robustness. Extensive experiments on CIFAR-10, CIFAR-100 and Imagenet datasets demonstrate the effectiveness of our proposed approach.
