Table of Contents
Fetching ...

From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs

Saad Ullah, Praneeth Balasubramanian, Wenbo Guo, Amanda Burnett, Hammond Pearce, Christopher Kruegel, Giovanni Vigna, Gianluca Stringhini

TL;DR

CVE-Genie tackles the shortage of high-quality, reproducible CVE benchmarks by introducing an automated, multi-agent framework that reproduces real-world CVEs end-to-end. Grounded in the EAGER criteria, the system decomposes CVE reproduction into four modules—Processor, Builder, Exploiter, and CTF Verifier—paired with systematically selected LLMs and task-specific prompts to ensure reliability and generalization. On a large, post-cutoff CVE set, CVE-Genie reproduced 428 of 841 CVEs across 267 projects and 22 programming languages, at roughly $2.77 per CVE, demonstrating scalable applicability and practical value for vulnerability detection, patching, and security evaluation. The approach provides open science artifacts, including code, datasets, and interaction logs, enabling defenders and researchers to build and assess tools against authentic, reproducible exploit environments. This work significatively advances automated vulnerability benchmarking, offering a robust foundation for fuzzer evaluation, patch verification, and AI-assisted security research.

Abstract

High-quality datasets of real-world vulnerabilities and their corresponding verifiable exploits are crucial resources in software security research. Yet such resources remain scarce, as their creation demands intensive manual effort and deep security expertise. In this paper, we present CVE-GENIE, an automated, large language model (LLM)-based multi-agent framework designed to reproduce real-world vulnerabilities, provided in Common Vulnerabilities and Exposures (CVE) format, to enable creation of high-quality vulnerability datasets. Given a CVE entry as input, CVE-GENIE gathers the relevant resources of the CVE, automatically reconstructs the vulnerable environment, and (re)produces a verifiable exploit. Our systematic evaluation highlights the efficiency and robustness of CVE-GENIE's design and successfully reproduces approximately 51% (428 of 841) CVEs published in 2024-2025, complete with their verifiable exploits, at an average cost of $2.77 per CVE. Our pipeline offers a robust method to generate reproducible CVE benchmarks, valuable for diverse applications such as fuzzer evaluation, vulnerability patching, and assessing AI's security capabilities.

From CVE Entries to Verifiable Exploits: An Automated Multi-Agent Framework for Reproducing CVEs

TL;DR

CVE-Genie tackles the shortage of high-quality, reproducible CVE benchmarks by introducing an automated, multi-agent framework that reproduces real-world CVEs end-to-end. Grounded in the EAGER criteria, the system decomposes CVE reproduction into four modules—Processor, Builder, Exploiter, and CTF Verifier—paired with systematically selected LLMs and task-specific prompts to ensure reliability and generalization. On a large, post-cutoff CVE set, CVE-Genie reproduced 428 of 841 CVEs across 267 projects and 22 programming languages, at roughly $2.77 per CVE, demonstrating scalable applicability and practical value for vulnerability detection, patching, and security evaluation. The approach provides open science artifacts, including code, datasets, and interaction logs, enabling defenders and researchers to build and assess tools against authentic, reproducible exploit environments. This work significatively advances automated vulnerability benchmarking, offering a robust foundation for fuzzer evaluation, patch verification, and AI-assisted security research.

Abstract

High-quality datasets of real-world vulnerabilities and their corresponding verifiable exploits are crucial resources in software security research. Yet such resources remain scarce, as their creation demands intensive manual effort and deep security expertise. In this paper, we present CVE-GENIE, an automated, large language model (LLM)-based multi-agent framework designed to reproduce real-world vulnerabilities, provided in Common Vulnerabilities and Exposures (CVE) format, to enable creation of high-quality vulnerability datasets. Given a CVE entry as input, CVE-GENIE gathers the relevant resources of the CVE, automatically reconstructs the vulnerable environment, and (re)produces a verifiable exploit. Our systematic evaluation highlights the efficiency and robustness of CVE-GENIE's design and successfully reproduces approximately 51% (428 of 841) CVEs published in 2024-2025, complete with their verifiable exploits, at an average cost of $2.77 per CVE. Our pipeline offers a robust method to generate reproducible CVE benchmarks, valuable for diverse applications such as fuzzer evaluation, vulnerability patching, and assessing AI's security capabilities.

Paper Structure

This paper contains 43 sections, 10 figures, 19 tables.

Figures (10)

  • Figure 1: CVE-Genie Overview.
  • Figure 2: CVE-Genie architecture and an end-to-end example of workflow of reproduction for CVE-2024-4340, i.e., Denial of Service due to RecursionError in sqlparse < v0.5.0. See artifact of CVE-2024-4340 complete reproduction run here - https://github.com/BUseclab/cve-genie/tree/main/results/CVE-2024-4340
  • Figure 3: PoC for CVE-2024-4340 in sqlparse v0.4.4, showing the exploit script and the resulting crash.
  • Figure 4: Verifier scripts for exploit (in Figure \ref{['fig:exp-example']}) for CVE-2024-4340 corresponding to the run in Figure \ref{['fig:architecture']}, illustrating the progression from a weak to a robust verifier based on Verifier Critic feedback.
  • Figure 5: Builder Optimal LLM Evaluation
  • ...and 5 more figures