Is AmI (Attacks Meet Interpretability) Robust to Adversarial Examples?
Nicholas Carlini
TL;DR
The paper evaluates AmI, an interpretability-based defense for detecting adversarial examples, by applying a defense-oblivious attack that crafts high-confidence adversaries on the original model and tests them on the defended model. It reports that AmI achieves 0% true-positive rate under untargeted attacks with a small distortion bound, indicating no robustness gain over the undefended network. The authors critique the lack of a formal threat model in the original defense publication and advocate for transparent evaluation and release of code to prevent overstated claims. Overall, the work emphasizes the need for rigorous, threat-model-driven assessment of defenses, especially those leveraging interpretability signals, to avoid misleading conclusions about robustness.
Abstract
No.
