Table of Contents
Fetching ...

Evaluating the Robustness of the "Ensemble Everything Everywhere" Defense

Jie Zhang, Christian Schlarmann, Kristina Nikolić, Nicholas Carlini, Francesco Croce, Matthias Hein, Florian Tramèr

TL;DR

The paper critically re-evaluates the Ensemble Everything Everywhere defense, showing that its claimed robustness is undermined by gradient masking and randomness. By constructing stronger adaptive attacks, including Transfer from a CrossMax-removed model and Expectation over Transformation, the authors reduce robust accuracy from approximately $0.62$ on CIFAR-10 and $0.48$ on CIFAR-100 to about $0.11$–$0.14$ under the $\ell_\infty$ threat model with $\varepsilon=8/255$. They demonstrate that AutoAttack alone is insufficient to establish robustness, and that perceptibly aligned gradients do not guarantee resilience to worst-case perturbations. The work underscores the need for rigorous, adaptive evaluations of defenses and discusses broader implications and potential legitimate uses of multi-resolution ideas beyond adversarial robustness.

Abstract

Ensemble everything everywhere is a defense to adversarial examples that was recently proposed to make image classifiers robust. This defense works by ensembling a model's intermediate representations at multiple noisy image resolutions, producing a single robust classification. This defense was shown to be effective against multiple state-of-the-art attacks. Perhaps even more convincingly, it was shown that the model's gradients are perceptually aligned: attacks against the model produce noise that perceptually resembles the targeted class. In this short note, we show that this defense is not robust to adversarial attack. We first show that the defense's randomness and ensembling method cause severe gradient masking. We then use standard adaptive attack techniques to reduce the defense's robust accuracy from 48% to 14% on CIFAR-100 and from 62% to 11% on CIFAR-10, under the $\ell_\infty$-norm threat model with $\varepsilon=8/255$.

Evaluating the Robustness of the "Ensemble Everything Everywhere" Defense

TL;DR

The paper critically re-evaluates the Ensemble Everything Everywhere defense, showing that its claimed robustness is undermined by gradient masking and randomness. By constructing stronger adaptive attacks, including Transfer from a CrossMax-removed model and Expectation over Transformation, the authors reduce robust accuracy from approximately on CIFAR-10 and on CIFAR-100 to about under the threat model with . They demonstrate that AutoAttack alone is insufficient to establish robustness, and that perceptibly aligned gradients do not guarantee resilience to worst-case perturbations. The work underscores the need for rigorous, adaptive evaluations of defenses and discusses broader implications and potential legitimate uses of multi-resolution ideas beyond adversarial robustness.

Abstract

Ensemble everything everywhere is a defense to adversarial examples that was recently proposed to make image classifiers robust. This defense works by ensembling a model's intermediate representations at multiple noisy image resolutions, producing a single robust classification. This defense was shown to be effective against multiple state-of-the-art attacks. Perhaps even more convincingly, it was shown that the model's gradients are perceptually aligned: attacks against the model produce noise that perceptually resembles the targeted class. In this short note, we show that this defense is not robust to adversarial attack. We first show that the defense's randomness and ensembling method cause severe gradient masking. We then use standard adaptive attack techniques to reduce the defense's robust accuracy from 48% to 14% on CIFAR-100 and from 62% to 11% on CIFAR-10, under the -norm threat model with .

Paper Structure

This paper contains 12 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: If we plot a two-dimensional slice of the loss surface, the original model in (a) has extremely large spikes in the logprobs of the model output. This makes it very difficult for gradient-based search to identify adversarial examples. If we remove randomness, in (b) we see that the loss surface is indeed smooth, confirming that the model itself does not introduce gradient instabilities. Therefore, by ensembling over randomness in (c) we can correct for the noisy loss surface and break the defense.
  • Figure 2: (a) For different attack strategies, we plot the attack's loss averaged over 100 examples: (1) directly optimizing the target model (in green) has trouble converging due to gradient masking; (2) attacking the "source" model with a mean aggregation (blue) works well; (3) transferring the attack from the source model to the target model (red) outperforms the attack that directly optimizes over the target model. (b) We show an ablation on the amount of EoT iterations with the APGD attack on 32 CIFAR-10 samples using 100 steps. We observe that it is crucial to use a sufficiently high amount of iterations, however, there is no consistent improvement beyond 100 EoT iterations.
  • Figure 3: Examples of clean images (top row) with the adversarial examples found by our attack (middle row), and the corresponding perturbation (bottom row, centered and magnified). In some cases, the perturbations are clearly interpretable, e.g, CIFAR-10 "cat" $\to$ "dog" (2nd from right) or CIFAR-100 "sea" $\to$ "apple" (5th from left).