Table of Contents
Fetching ...

AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses

Nicholas Carlini, Javier Rando, Edoardo Debenedetti, Milad Nasr, Florian Tramèr

TL;DR

AutoAdvExBench presents a proxy-free benchmark that directly measures an LLM's ability to autonomously break adversarial-example defenses by integrating defense papers and implementations and producing end-to-end adversarial attacks. The study demonstrates a substantial gap between performance on CTF-like defenses ($75\%$ attack success on a $24$-defense subset) and real-world defenses ($21\%$ on $51$ defenses with the strongest model), underscoring the importance of using real-world, end-to-end data for security evaluations. By designing a specialized, stepwise agent, the paper shows that decomposing the attack process into forward-pass, differentiability, FGSM, and PGD steps markedly improves success rates, yet real-world code remains far more challenging. These results highlight the need for continuous, proxy-free benchmarks that closely mirror practical security tasks and offer a scalable, interpretable measure of progress in AI-assisted security research and broader AI-safety implications.

Abstract

We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, bench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in bench, it would immediately present practical utility for adversarial machine learning researchers. We then design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses. However, we show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code. In contrast, a stronger LLM that can attack 21% of real defenses only succeeds on 54% of CTF-like defenses. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.

AutoAdvExBench: Benchmarking autonomous exploitation of adversarial example defenses

TL;DR

AutoAdvExBench presents a proxy-free benchmark that directly measures an LLM's ability to autonomously break adversarial-example defenses by integrating defense papers and implementations and producing end-to-end adversarial attacks. The study demonstrates a substantial gap between performance on CTF-like defenses ( attack success on a -defense subset) and real-world defenses ( on defenses with the strongest model), underscoring the importance of using real-world, end-to-end data for security evaluations. By designing a specialized, stepwise agent, the paper shows that decomposing the attack process into forward-pass, differentiability, FGSM, and PGD steps markedly improves success rates, yet real-world code remains far more challenging. These results highlight the need for continuous, proxy-free benchmarks that closely mirror practical security tasks and offer a scalable, interpretable measure of progress in AI-assisted security research and broader AI-safety implications.

Abstract

We introduce AutoAdvExBench, a benchmark to evaluate if large language models (LLMs) can autonomously exploit defenses to adversarial examples. Unlike existing security benchmarks that often serve as proxies for real-world tasks, bench directly measures LLMs' success on tasks regularly performed by machine learning security experts. This approach offers a significant advantage: if a LLM could solve the challenges presented in bench, it would immediately present practical utility for adversarial machine learning researchers. We then design a strong agent that is capable of breaking 75% of CTF-like ("homework exercise") adversarial example defenses. However, we show that this agent is only able to succeed on 13% of the real-world defenses in our benchmark, indicating the large gap between difficulty in attacking "real" code, and CTF-like code. In contrast, a stronger LLM that can attack 21% of real defenses only succeeds on 54% of CTF-like defenses. We make this benchmark available at https://github.com/ethz-spylab/AutoAdvExBench.

Paper Structure

This paper contains 38 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: We curate a dataset of 51 real-world defense implementations. We do this by crawling arXiv papers, filtering to just those on adversarial machine learning using a simple Naive Bayes classifier, further filtering this down to a set of potential defenses to adversarial examples by few-shot prompting GPT-4o, manually filtering this down to defenses with public implementations, and further manually filtering this down to 40 reproducible GitHub repositories. Because some papers describe multiple defenses, and some papers are implemented multiple times, this increases slightly to 51 total defense implementations of 46 unique papers.
  • Figure 2: Our strongest agent is able to attack (left) 75% (18 out of 24) of defenses in the subset of CTF-like ("homework exercise") adversarial example defenses, compared to (right) 17% (9 out of 51) defenses from real-world code implementations for the same model. The more recent Claude 3.7 Sonnet, in contrast, succeeds much more often on the real world dataset (attacking 11 out of the 51 defenses) but far less often on the CTF-like defenses (just 13 out of 24). While LLMs do have the tools available to break adversarial example defenses, they can not yet reliably do so given the complications real world code.