SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

Cen Zhang; Younggi Park; Fabian Fleischer; Yu-Fu Fu; Jiho Kim; Dongkwan Kim; Youngjoon Kim; Qingxiao Xu; Andrew Chin; Ze Sheng; Hanqing Zhao; Brian J. Lee; Joshua Wang; Michael Pelican; David J. Musliner; Jeff Huang; Jon Silliman; Mikel Mcdaniel; Jefferson Casavant; Isaac Goldthwaite; Nicholas Vidovich; Matthew Lehman; Taesoo Kim

SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

Cen Zhang, Younggi Park, Fabian Fleischer, Yu-Fu Fu, Jiho Kim, Dongkwan Kim, Youngjoon Kim, Qingxiao Xu, Andrew Chin, Ze Sheng, Hanqing Zhao, Brian J. Lee, Joshua Wang, Michael Pelican, David J. Musliner, Jeff Huang, Jon Silliman, Mikel Mcdaniel, Jefferson Casavant, Isaac Goldthwaite, Nicholas Vidovich, Matthew Lehman, Taesoo Kim

TL;DR

This paper systematically analyzes DARPA's AI Cyber Challenge (AIxCC), the largest autonomous vulnerability-discovery competition to date, examining its design, the architectures of seven finalist CRSs, and the competition results across 53 challenge projects derived from 24 OSS repositories. It reveals that sustained, robust integration and real-world operating conditions drive performance more than any single technique, with foundational methods solving many CPVs but facing reliability bottlenecks in large-scale autonomous operation. The study introduces a taxonomy of CRS techniques (PoV generation, patch generation, SARIF validation, bundling), analyzes per-CPV performance against baseline techniques, and extracts lessons for competition design, deployment in OSS, and future research directions—emphasizing telemetry, open-source model exploration, and resource-constrained deployment. The findings offer actionable guidance for organizers, researchers, and practitioners aiming to advance autonomous CRSs from competition settings to practical security tooling, with implications for multi-CRS coordination, semantic correctness evaluation, and deployment pathways.

Abstract

DARPA's AI Cyber Challenge (AIxCC, 2023--2025) is the largest competition to date for building fully autonomous cyber reasoning systems (CRSs) that leverage recent advances in AI -- particularly large language models (LLMs) -- to discover and remediate vulnerabilities in real-world open-source software. This paper presents the first systematic analysis of AIxCC. Drawing on design documents, source code, execution traces, and discussions with organizers and competing teams, we examine the competition's structure and key design decisions, characterize the architectural approaches of finalist CRSs, and analyze competition results beyond the final scoreboard. Our analysis reveals the factors that truly drove CRS performance, identifies genuine technical advances achieved by teams, and exposes limitations that remain open for future research. We conclude with lessons for organizing future competitions and broader insights toward deploying autonomous CRSs in practice.

SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

TL;DR

Abstract

Paper Structure (34 sections, 5 equations, 8 figures, 14 tables)

This paper contains 34 sections, 5 equations, 8 figures, 14 tables.

Introduction
Background: AIxCC as Competition
Competition Design
Design Goal: Real-World Relevance
Iterative Design
Challenge Projects
Cyber Reasoning Systems
Taxonomy of CRS Techniques
PoV Generation
Patch Generation
SARIF Validation
Bundling Strategy
Competition Result Analysis
What Scores Reveal (and Conceal)
Auxiliary CPV Annotation
...and 19 more sections

Figures (8)

Figure 1: AFC workflow. GitHub webhooks trigger challenge dispatch and CRSs submit results via the Competition API. Each CRS operates in an isolated network with access to the Competition API, build dependencies, and LLM endpoints.
Figure 2: Score per time (top) and phase (bottom) axes.
Figure 3: Team performance per CPV (CWE-wise breakdowns are in \ref{['s:app:cwe-analysis']}). Matrices i)--iv) indicate successfully detected and patched CPVs and 0-days; matrix v) shows SARIF assessment results. We mark CPVs for which a CRS did not send any log messages (diagonal line). We annotate each CPV that can be found by an off-the-shelf parallel fuzzer (PF) or patched by Claude Code (CC) or a multi-retrieval agent (MR), both using Claude 3.7 Sonnet under ideal laboratory conditions. For invalid SARIF broadcasts (-), the expected assessment is Incorrect; ✓ and ✗ indicate the CRS assessed it as Incorrect and Correct, respectively.
Figure 4: Token consumption (input $+$ output) per model by team.
Figure 5: I/O token ratio per model by team.
...and 3 more figures

SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

TL;DR

Abstract

SoK: DARPA's AI Cyber Challenge (AIxCC): Competition Design, Architectures, and Lessons Learned

Authors

TL;DR

Abstract

Table of Contents

Figures (8)