CyberGym: Evaluating AI Agents' Real-World Cybersecurity Capabilities at Scale
Zhun Wang, Tianneng Shi, Jingxuan He, Matthew Cai, Jialin Zhang, Dawn Song
TL;DR
CyberGym proposes a large-scale, realistic benchmark to evaluate AI agents’ cybersecurity capabilities by requiring PoCs that reproduce 1,507 real-world vulnerabilities across 188 OSS projects, with execution-based verification on pre- and post-patch versions. It introduces a ladder of difficulty, including open-ended discovery and provision of patch data, to reflect real-world vulnerability lifecycles and enable progression tracking. The study shows current frontier agents and LLMs struggle, with top results around 22.0% success, underscoring the benchmark’s challenge and its value for measuring progress in cybersecurity AI. Beyond benchmarking, CyberGym demonstrates direct security impact through incomplete-patch and zero-day discoveries (35 zero-days, 17 incomplete patches; 3 CVEs assigned; 6 patches), and enables large-scale vulnerability discovery across 431 projects, highlighting practical security benefits of AI-assisted analysis.
Abstract
AI agents have significant potential to reshape cybersecurity, making a thorough assessment of their capabilities critical. However, existing evaluations fall short, because they are based on small-scale benchmarks and only measure static outcomes, failing to capture the full, dynamic range of real-world security challenges. To address these limitations, we introduce CyberGym, a large-scale benchmark featuring 1,507 real-world vulnerabilities across 188 software projects. Adjustable to different vulnerability analysis settings, CyberGym primarily tasks agents with generating a proof-of-concept test that reproduces a vulnerability, given only its text description and the corresponding codebase. Our extensive evaluation highlights that CyberGym effectively differentiates agents' and models' cybersecurity capabilities. Even the top-performing combinations only achieve a ~20% success rate, demonstrating the overall difficulty of CyberGym. Beyond static benchmarking, we show that CyberGym leads to the discovery of 35 zero-day vulnerabilities and 17 historically incomplete patches. These results underscore that CyberGym is not only a robust benchmark for measuring AI's progress in cybersecurity but also a platform for creating direct, real-world security impact.
