SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned
Cen Zhang, Younggi Park, Fabian Fleischer, Yu-Fu Fu, Jiho Kim, Dongkwan Kim, Youngjoon Kim, Qingxiao Xu, Andrew Chin, Ze Sheng, Hanqing Zhao, Brian J. Lee, Joshua Wang, Michael Pelican, David J. Musliner, Jeff Huang, Jon Silliman, Mikel Mcdaniel, Jefferson Casavant, Isaac Goldthwaite, Nicholas Vidovich, Matthew Lehman, Taesoo Kim
TL;DR
This paper systematically analyzes DARPA's AI Cyber Challenge (AIxCC), the largest autonomous vulnerability-discovery competition to date, examining its design, the architectures of seven finalist CRSs, and the competition results across 53 challenge projects derived from 24 OSS repositories. It reveals that sustained, robust integration and real-world operating conditions drive performance more than any single technique, with foundational methods solving many CPVs but facing reliability bottlenecks in large-scale autonomous operation. The study introduces a taxonomy of CRS techniques (PoV generation, patch generation, SARIF validation, bundling), analyzes per-CPV performance against baseline techniques, and extracts lessons for competition design, deployment in OSS, and future research directions—emphasizing telemetry, open-source model exploration, and resource-constrained deployment. The findings offer actionable guidance for organizers, researchers, and practitioners aiming to advance autonomous CRSs from competition settings to practical security tooling, with implications for multi-CRS coordination, semantic correctness evaluation, and deployment pathways.
Abstract
DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CRSs) that leverage recent advances in AI -- particularly large language models (LLMs) -- to discover and remediate vulnerabilities in real-world open-source software. This paper presents the first systematic analysis of AIxCC. Drawing on design documents, source code, execution traces, and discussions with organizers and competing teams, we examine the competition's structure and key design decisions, characterize the architectural approaches of finalist CRSs, and analyze competition results beyond the final scoreboard. Our analysis reveals the factors that truly drove CRS performance, identifies genuine technical advances achieved by teams, and exposes limitations that remain open for future research. We conclude with lessons for organizing future competitions and broader insights toward deploying autonomous CRSs in practice.
