Table of Contents
Fetching ...

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

Yuxuan Zhu, Antony Kellermann, Dylan Bowman, Philip Li, Akul Gupta, Adarsh Danda, Richard Fang, Conner Jensen, Eric Ihli, Jason Benn, Jet Geronimo, Avi Dhir, Sudhit Rao, Kaicheng Yu, Twm Stone, Daniel Kang

TL;DR

CVE-Bench introduces a real-world web vulnerability benchmark to evaluate AI/LLM agents' ability to exploit critical CVEs. It combines a sandboxed, containerized framework with eight standardized attack types, automatic evaluation, and reproduced exploits across 40 CVEs from the NVD, incorporating zero-day and one-day lifecycles. Experimental results show modest exploitation rates (up to 13%), highlighting both the advancing capabilities of AI agents and the need for comprehensive red-teaming and governance. The benchmark provides a foundation for systematic, reproducible assessment of AI-driven cybersecurity capabilities and risks in real-world web environments.

Abstract

Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state-of-the-art agent framework can resolve up to 13% of vulnerabilities.

CVE-Bench: A Benchmark for AI Agents' Ability to Exploit Real-World Web Application Vulnerabilities

TL;DR

CVE-Bench introduces a real-world web vulnerability benchmark to evaluate AI/LLM agents' ability to exploit critical CVEs. It combines a sandboxed, containerized framework with eight standardized attack types, automatic evaluation, and reproduced exploits across 40 CVEs from the NVD, incorporating zero-day and one-day lifecycles. Experimental results show modest exploitation rates (up to 13%), highlighting both the advancing capabilities of AI agents and the need for comprehensive red-teaming and governance. The benchmark provides a foundation for systematic, reproducible assessment of AI-driven cybersecurity capabilities and risks in real-world web environments.

Abstract

Large language model (LLM) agents are increasingly capable of autonomously conducting cyberattacks, posing significant threats to existing applications. This growing risk highlights the urgent need for a real-world benchmark to evaluate the ability of LLM agents to exploit web application vulnerabilities. However, existing benchmarks fall short as they are limited to abstracted Capture the Flag competitions or lack comprehensive coverage. Building a benchmark for real-world vulnerabilities involves both specialized expertise to reproduce exploits and a systematic approach to evaluating unpredictable threats. To address this challenge, we introduce CVE-Bench, a real-world cybersecurity benchmark based on critical-severity Common Vulnerabilities and Exposures. In CVE-Bench, we design a sandbox framework that enables LLM agents to exploit vulnerable web applications in scenarios that mimic real-world conditions, while also providing effective evaluation of their exploits. Our evaluation shows that the state-of-the-art agent framework can resolve up to 13% of vulnerabilities.

Paper Structure

This paper contains 22 sections, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustration of the sandbox framework in CVE-Bench as applied to a WordPress web application. It features environment isolation and supports various stages of the vulnerability lifecycle (e.g., zero-day and one-day), diverse attacks, and automatic evaluation.
  • Figure 2: Distribution of attack types in our exploit reproduction of all vulnerabilities in CVE-Bench. We consider all types of attacks when evaluating LLM agents.
  • Figure 3: Success rates of different LLM agents on CVE-Bench. LLM agents can exploit up to 10% and 13% vulnerabilities under zero-day and one-day settings, respectively.
  • Figure 4: Distribution of successful exploits by Agents. We only show the types of attack conducted successfully.
  • Figure 5: Running ZAP on CVE-2023-37999 with all options enabled. ZAP identified 19 low-to-medium risks, while none of these risks are related with critical vulnerability reported in CVE-2023-37999.