Cutting through buggy adversarial example defenses: fixing 1 line of code breaks Sabre
Nicholas Carlini
TL;DR
The paper critically examines Sabre, a purportedly highly robust adversarial defense, and identifies systematic evaluation flaws—rooted in a bug that inflates robustness and successive code changes that introduce gradient masking. Through a sequence of attacks, the authors demonstrate that fixing a single line can reduce robustness to 0%, and that later revisions introduce further nondifferentiable components that corrupt the evaluation. The authors also show that Sabre's comparisons to baselines and its evaluation against adaptive attacks are inadequate or misrepresented. The work emphasizes the necessity of rigorous, reproducible evaluations with adaptive and gradient-free attacks to credibly assess robustness claims, warning against the spread of brittle defenses built on flawed analyses.
Abstract
Sabre is a defense to adversarial examples that was accepted at IEEE S&P 2024. We first reveal significant flaws in the evaluation that point to clear signs of gradient masking. We then show the cause of this gradient masking: a bug in the original evaluation code. By fixing a single line of code in the original repository, we reduce Sabre's robust accuracy to 0%. In response to this, the authors modify the defense and introduce a new defense component not described in the original paper. But this fix contains a second bug; modifying one more line of code reduces robust accuracy to below baseline levels. After we released the first version of our paper online, the authors introduced another change to the defense; by commenting out one line of code during attack we reduce the robust accuracy to 0% again.
