Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies

Brian R. Bartoldson; James Diffenderfer; Konstantinos Parasyris; Bhavya Kailkhura

Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies

Brian R. Bartoldson, James Diffenderfer, Konstantinos Parasyris, Bhavya Kailkhura

TL;DR

The paper tackles the CIFAR10 adversarial robustness gap by deriving scaling laws that incorporate synthetic data quality, enabling compute-aware optimization and exposing asymptotic robustness limits. It introduces three complementary modeling approaches and demonstrates compute-efficient training that surpasses prior state-of-the-art, while revealing fundamental barriers to human-level robustness. A small-scale human study uncovers validity issues in adversarial data, suggesting that current benchmarks may underestimate human performance and mischaracterize robustness. The work argues that progress will require more efficient training algorithms, improved architectures, and redesigned attack formulations that ensure image validity, rather than mere scaling. Together, these contributions provide actionable guidance for designing robust models under practical compute budgets and for rethinking how robustness benchmarks are constructed.

Abstract

This paper revisits the simple, long-studied, yet still unsolved problem of making image classifiers robust to imperceptible perturbations. Taking CIFAR10 as an example, SOTA clean accuracy is about $100$%, but SOTA robustness to $\ell_{\infty}$-norm bounded perturbations barely exceeds $70$%. To understand this gap, we analyze how model size, dataset size, and synthetic data quality affect robustness by developing the first scaling laws for adversarial training. Our scaling laws reveal inefficiencies in prior art and provide actionable feedback to advance the field. For instance, we discovered that SOTA methods diverge notably from compute-optimal setups, using excess compute for their level of robustness. Leveraging a compute-efficient setup, we surpass the prior SOTA with $20$% ($70$%) fewer training (inference) FLOPs. We trained various compute-efficient models, with our best achieving $74$% AutoAttack accuracy ($+3$% gain). However, our scaling laws also predict robustness slowly grows then plateaus at $90$%: dwarfing our new SOTA by scaling is impractical, and perfect robustness is impossible. To better understand this predicted limit, we carry out a small-scale human evaluation on the AutoAttack data that fools our top-performing model. Concerningly, we estimate that human performance also plateaus near $90$%, which we show to be attributable to $\ell_{\infty}$-constrained attacks' generation of invalid images not consistent with their original labels. Having characterized limiting roadblocks, we outline promising paths for future research.

Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies

TL;DR

Abstract

This paper revisits the simple, long-studied, yet still unsolved problem of making image classifiers robust to imperceptible perturbations. Taking CIFAR10 as an example, SOTA clean accuracy is about

%, but SOTA robustness to

-norm bounded perturbations barely exceeds

%. To understand this gap, we analyze how model size, dataset size, and synthetic data quality affect robustness by developing the first scaling laws for adversarial training. Our scaling laws reveal inefficiencies in prior art and provide actionable feedback to advance the field. For instance, we discovered that SOTA methods diverge notably from compute-optimal setups, using excess compute for their level of robustness. Leveraging a compute-efficient setup, we surpass the prior SOTA with

% (

%) fewer training (inference) FLOPs. We trained various compute-efficient models, with our best achieving

% AutoAttack accuracy (

% gain). However, our scaling laws also predict robustness slowly grows then plateaus at

%: dwarfing our new SOTA by scaling is impractical, and perfect robustness is impossible. To better understand this predicted limit, we carry out a small-scale human evaluation on the AutoAttack data that fools our top-performing model. Concerningly, we estimate that human performance also plateaus near

%, which we show to be attributable to

-constrained attacks' generation of invalid images not consistent with their original labels. Having characterized limiting roadblocks, we outline promising paths for future research.

Paper Structure (47 sections, 8 equations, 21 figures, 4 tables)

This paper contains 47 sections, 8 equations, 21 figures, 4 tables.

Introduction
Related Work
Adversarial Training Trends
Adversarial Examples That Fool Humans
Scaling Laws for Adversarial Robustness
Background and Methodology
Adversarial Robustness
Adversarial Training Settings
Training Datasets
Models
Evaluation
Deriving Scaling Laws
Approach 1: Envelope of training curves
Approach 2: Parametric loss with data quality terms
Approach 3: Parametric loss modeling data quality and its effect on overfitting
...and 32 more sections

Figures (21)

Figure 1: SOTA techniques cannot solve the CIFAR10 adversarial robustness problem, even with infinite compute. (left) Scaling laws predict that performance asymptotes near 90%. We produce this estimate using Approach 2 (Section \ref{['sec:approach_2']}), a scaling law approach we designed that can model the effect of FID, allowing approximation of performance when training on a hypothetical FID=0.0 synthetic dataset of arbitrarily large size. Note that lowering FID raises dataset quality. We learned the parameters of Approach 2's parametric model of testing performance by fitting to the performances that resulted from training various NNs on various datasets; the depicted dots show a subset of these performances. (right) A CIFAR10 test image is adversarially perturbed by AutoAttack, causing humans to disagree with the ground-truth label "horse". Consistent with scaling law asymptotes, our small-scale human study (Section \ref{['sec:human']}) estimates that the original label is no longer an appropriate label for roughly $10$% of adversarial data. Benchmarking that uses the original label after it ceases to be a good fit for the image will suggest that NN robustness is further from being "solved" (human-level) than it truly is.
Figure 2: Subtle attacks transfer to humans, showing brittleness in human visual perception. Try to classify these adversarial images that fool our SOTA neural net as CIFAR10 classes (airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck). Next, see Figure \ref{['fig:fool-humans']} for the clean counterparts of these images to see if you were also fooled. Section \ref{['sec:human']} further discusses such "invalid" data.
Figure 3: Approach 1 fits to the envelope of training curves. This adversarial robustness scaling law approach does not model the effect of training dataset FID on performance, so it is only fit to results associated with one training dataset. For example, in the left plot, we show the learning curves of our models trained on PFGM++ synthetic data. When training, we vary $N$ between 6M and 316M and $D$ between 5M and 100M. Given the optimal $N,D$ combination at each FLOP level (the "envelope"), we estimate how the optimal $N$, $N^*$ (middle), and optimal $D$, $D^*$ (right), changes as a function of FLOPs via log-space regressions on points sampled from the envelope.
Figure 4: Approach 2 models the effect of synthetic dataset quality. Approach 2 can fit to training runs that used datasets with differing FIDs by modeling FID's effect on performance. (left) Equation \ref{['eq:approach_2']} fits runs that used different models and dataset sizes/qualities. EDM-6 and PFGM++ were not used to learn the fit's parameters, but are plotted to show extrapolation performance of the fit. We plot isoLoss contours (regions with the same loss predicted by our scaling laws; higher losses are brighter) and the curves for $N^*$ (middle) and $D^*$ (right), the training settings predicted to be compute-optimal at a FLOPs budget for a synthetic dataset with DG20's quality.
Figure 5: Approach 3 fits to models trained on data of any FID. Our third approach can account for overfitting happening sooner on lower quality datasets, allowing it to simultaneously model training runs from all of our datasets, even those with very low qualities. As a result, Approach 3's learned parameters account for more training run results than other approaches' parameters do.
...and 16 more figures

Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies

TL;DR

Abstract

Adversarial Robustness Limits via Scaling-Law and Human-Alignment Studies

Authors

TL;DR

Abstract

Table of Contents

Figures (21)