Table of Contents
Fetching ...

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song

TL;DR

CyberGym proposes a large-scale, realistic benchmark to evaluate AI agents’ cybersecurity capabilities by requiring PoCs that reproduce 1,507 real-world vulnerabilities across 188 OSS projects, with execution-based verification on pre- and post-patch versions. It introduces a ladder of difficulty, including open-ended discovery and provision of patch data, to reflect real-world vulnerability lifecycles and enable progression tracking. The study shows current frontier agents and LLMs struggle, with top results around 22.0% success, underscoring the benchmark’s challenge and its value for measuring progress in cybersecurity AI. Beyond benchmarking, CyberGym demonstrates direct security impact through incomplete-patch and zero-day discoveries (35 zero-days, 17 incomplete patches; 3 CVEs assigned; 6 patches), and enables large-scale vulnerability discovery across 431 projects, highlighting practical security benefits of AI-assisted analysis.

Abstract

AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.

CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale

TL;DR

CyberGym proposes a large-scale, realistic benchmark to evaluate AI agents’ cybersecurity capabilities by requiring PoCs that reproduce 1,507 real-world vulnerabilities across 188 OSS projects, with execution-based verification on pre- and post-patch versions. It introduces a ladder of difficulty, including open-ended discovery and provision of patch data, to reflect real-world vulnerability lifecycles and enable progression tracking. The study shows current frontier agents and LLMs struggle, with top results around 22.0% success, underscoring the benchmark’s challenge and its value for measuring progress in cybersecurity AI. Beyond benchmarking, CyberGym demonstrates direct security impact through incomplete-patch and zero-day discoveries (35 zero-days, 17 incomplete patches; 3 CVEs assigned; 6 patches), and enables large-scale vulnerability discovery across 431 projects, highlighting practical security benefits of AI-assisted analysis.

Abstract

AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.

Paper Structure

This paper contains 52 sections, 18 figures, 6 tables.

Figures (18)

  • Figure 1: CyberGym includes 1,507 instances from real-world vulnerabilities across 188 diverse projects. For benchmarking, AI agents receive vulnerability descriptions and pre-patch codebased to generate proof-of-concept (PoC) tests for vulnerability reproduction. Going a step further, CyberGym creates direct security impact via detecting incomplete patches and zero-day vulnerabilities.
  • Figure 2: OSS-Fuzz lifecycle.
  • Figure 3: Results of various LLMs with OpenHands.
  • Figure 4: With and without thinking.
  • Figure 5: Success rates of different agent frameworks using GPT-4.1.
  • ...and 13 more figures