AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses
Nicholas Carlini, Javier Rando, Edoardo Debenedetti, Milad Nasr, Florian Tramèr
TL;DR
AutoAdvExBench presents a proxy-free benchmark that directly measures an LLM's ability to autonomously break adversarial-example defenses by integrating defense papers and implementations and producing end-to-end adversarial attacks. The study demonstrates a substantial gap between performance on CTF-like defenses ($75\%$ attack success on a $24$-defense subset) and real-world defenses ($21\%$ on $51$ defenses with the strongest model), underscoring the importance of using real-world, end-to-end data for security evaluations. By designing a specialized, stepwise agent, the paper shows that decomposing the attack process into forward-pass, differentiability, FGSM, and PGD steps markedly improves success rates, yet real-world code remains far more challenging. These results highlight the need for continuous, proxy-free benchmarks that closely mirror practical security tasks and offer a scalable, interpretable measure of progress in AI-assisted security research and broader AI-safety implications.
Abstract
We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, bench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in bench, it would immediately present practical utility for adversarial machine learning researchers. We then design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses. However, we show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code. In contrast, a stronger LLM that can attack 21% of real defenses only succeeds on 54% of CTF-like defenses. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.
