Table of Contents
Fetching ...

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

Andy K. Zhang, Joey Ji, Celeste Menders, Riya Dulepet, Thomas Qin, Ron Y. Wang, Junrong Wu, Kyleen Liao, Jiliang Li, Jinghan Hu, Sara Hong, Nardos Demilew, Shivatmica Murgai, Jason Tran, Nishka Kacheria, Ethan Ho, Denis Liu, Lauren McLane, Olivia Bruvik, Dai-Rong Han, Seungwoo Kim, Akhil Vyas, Cuiyuanxiu Chen, Ryan Li, Weiran Xu, Jonathan Z. Ye, Prerit Choudhary, Siddharth M. Bhatia, Vikram Sivashankar, Yuxuan Bao, Dawn Song, Dan Boneh, Daniel E. Ho, Percy Liang

TL;DR

The paper defines a framework to quantify offensive and defensive cyber capabilities in evolving real-world systems and instantiates it as BountyBench, a benchmark with 25 systems and 40 bug bounties spanning 9 OWASP risks. It formalizes a vulnerability lifecycle with Detect, Exploit, and Patch tasks, introduces a Detect Indicator for localized evaluation, and uses information as a difficulty modulator to differentiate agent performance. Through evaluations of 10 agents, it reveals offense-defense dynamics, shows that information-rich settings improve discrimination among agents, and maps task outcomes to real monetary rewards. The work provides a pathway to robustly assess AI-driven cybersecurity capabilities and informs defense strategies, policy considerations, and future benchmark scaling.

Abstract

AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a given vulnerability), and Patch (patching a given vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards from \$10 to \$30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a given vulnerability. We evaluate 10 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, Qwen3 235B A22B, Llama 4 Maverick, and DeepSeek-R1. Given up to three attempts, the top-performing agents are Codex CLI: o3-high (12.5% on Detect, mapping to \$3,720; 90% on Patch, mapping to \$14,152), Custom Agent: Claude 3.7 Sonnet Thinking (67.5% on Exploit), and Codex CLI: o4-mini (90% on Patch, mapping to \$14,422). Codex CLI: o3-high, Codex CLI: o4-mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 17.5-67.5% and Patch scores of 25-60%.

BountyBench: Dollar Impact of AI Agent Attackers and Defenders on Real-World Cybersecurity Systems

TL;DR

The paper defines a framework to quantify offensive and defensive cyber capabilities in evolving real-world systems and instantiates it as BountyBench, a benchmark with 25 systems and 40 bug bounties spanning 9 OWASP risks. It formalizes a vulnerability lifecycle with Detect, Exploit, and Patch tasks, introduces a Detect Indicator for localized evaluation, and uses information as a difficulty modulator to differentiate agent performance. Through evaluations of 10 agents, it reveals offense-defense dynamics, shows that information-rich settings improve discrimination among agents, and maps task outcomes to real monetary rewards. The work provides a pathway to robustly assess AI-driven cybersecurity capabilities and informs defense strategies, policy considerations, and future benchmark scaling.

Abstract

AI agents have the potential to significantly alter the cybersecurity landscape. Here, we introduce the first framework to capture offensive and defensive cyber-capabilities in evolving real-world systems. Instantiating this framework with BountyBench, we set up 25 systems with complex, real-world codebases. To capture the vulnerability lifecycle, we define three task types: Detect (detecting a new vulnerability), Exploit (exploiting a given vulnerability), and Patch (patching a given vulnerability). For Detect, we construct a new success indicator, which is general across vulnerability types and provides localized evaluation. We manually set up the environment for each system, including installing packages, setting up server(s), and hydrating database(s). We add 40 bug bounties, which are vulnerabilities with monetary awards from \30,485, covering 9 of the OWASP Top 10 Risks. To modulate task difficulty, we devise a new strategy based on information to guide detection, interpolating from identifying a zero day to exploiting a given vulnerability. We evaluate 10 agents: Claude Code, OpenAI Codex CLI with o3-high and o4-mini, and custom agents with o3-high, GPT-4.1, Gemini 2.5 Pro Preview, Claude 3.7 Sonnet Thinking, Qwen3 235B A22B, Llama 4 Maverick, and DeepSeek-R1. Given up to three attempts, the top-performing agents are Codex CLI: o3-high (12.5% on Detect, mapping to \14,152), Custom Agent: Claude 3.7 Sonnet Thinking (67.5% on Exploit), and Codex CLI: o4-mini (90% on Patch, mapping to \$14,422). Codex CLI: o3-high, Codex CLI: o4-mini, and Claude Code are more capable at defense, achieving higher Patch scores of 90%, 90%, and 87.5%, compared to Exploit scores of 47.5%, 32.5%, and 57.5% respectively; while the custom agents are relatively balanced between offense and defense, achieving Exploit scores of 17.5-67.5% and Patch scores of 25-60%.

Paper Structure

This paper contains 68 sections, 4 equations, 24 figures, 71 tables.

Figures (24)

  • Figure 1: BountyBench consists of Detect, Exploit, and Patch tasks, which each pass a distinct task input to the agent. The agent takes an action in a Kali Linux container containing the codebase, which can connect to any server(s) and/or database(s) via the network. Execution of the command yields an observation, which the agent leverages to take additional actions in an action-observation loop until the agent submits the task output to the evaluator, which then scores the submission on various metrics including success/failure, dollar value, and usage metrics.
  • Figure 2: Each system consists of a series of snapshots, each associated with runtimes, invariants, and vulnerabilities. Each vulnerability is associated with exploits, verifiers, and patches.
  • Figure 3: Flow diagrams for each of the 3 task types: Detect, Exploit, and Patch.
  • Figure 4: On the Detect task with increasing levels of information, we see improvement in agent performance as information increases from detection to exploitation, demonstrating that information is an effective modulator of task difficulty.
  • Figure 5: Tasks sorted by disclosure bounty value (log scale). Disclosure bounties range from $40 to $30,485, and patch bounties from $10 to $7,621.25, with patch bounties averaging approximately 25% of their corresponding disclosure amounts. fastapi 0 was assigned a disclosure bounty that was not publicly disclosed. parse-url 0 was assigned a fix bounty that was not publicly disclosed.
  • ...and 19 more figures