Table of Contents
Fetching ...

Cybersecurity AI: The World's Top AI Agent for Security Capture-the-Flag (CTF)

Víctor Mayoral-Vilches, Luis Javier Navarrete-Lozano, Francesco Balassone, María Sanz-Gómez, Cristóbal R. J. Veas Chavez, Maite del Mundo de Torres, Vanesa Turiel

TL;DR

The paper investigates the obsolescence of Jeopardy-style CTFs in the era of autonomous AI security agents and presents CAI, an alias1-based framework with entropy-driven multi-model orchestration, as a scalable, cost-efficient solution. Across five major CTF circuits, CAI achieves top ranks, high solve rates (up to 91%), and substantial velocity advantages over human teams, while dramatically cutting operational costs (≈98% reduction) to enable continuous defense operations. The authors argue for transitioning to Attack & Defense formats to better test adaptive reasoning and resilience in real-world contexts, particularly for OT security, and discuss ethical, benchmarking, and governance considerations. Collectively, the work demonstrates that AI-enabled security can operate at enterprise scale and challenges the research and industry communities to redefine evaluation standards and deployment strategies for autonomous cybersecurity.

Abstract

Are Capture-the-Flag competitions obsolete? In 2025, Cybersecurity AI (CAI) systematically conquered some of the world's most prestigious hacking competitions, achieving Rank #1 at multiple events and consistently outperforming thousands of human teams. Across five major circuits-HTB's AI vs Humans, Cyber Apocalypse (8,129 teams), Dragos OT CTF, UWSP Pointer Overflow, and the Neurogrid CTF showdown-CAI demonstrated that Jeopardy-style CTFs have become a solved game for well-engineered AI agents. At Neurogrid, CAI captured 41/45 flags to claim the $50,000 top prize; at Dragos OT, it sprinted 37% faster to 10K points than elite human teams; even when deliberately paused mid-competition, it maintained top-tier rankings. Critically, CAI achieved this dominance through our specialized alias1 model architecture, which delivers enterprise-scale AI security operations at unprecedented cost efficiency and with augmented autonomy-reducing 1B token inference costs from $5,940 to just $119, making continuous security agent operation financially viable for the first time. These results force an uncomfortable reckoning: if autonomous agents now dominate competitions designed to identify top security talent at negligible cost, what are CTFs actually measuring? This paper presents comprehensive evidence of AI capability across the 2025 CTF circuit and argues that the security community must urgently transition from Jeopardy-style contests to Attack & Defense formats that genuinely test adaptive reasoning and resilience-capabilities that remain uniquely human, for now.

Cybersecurity AI: The World's Top AI Agent for Security Capture-the-Flag (CTF)

TL;DR

The paper investigates the obsolescence of Jeopardy-style CTFs in the era of autonomous AI security agents and presents CAI, an alias1-based framework with entropy-driven multi-model orchestration, as a scalable, cost-efficient solution. Across five major CTF circuits, CAI achieves top ranks, high solve rates (up to 91%), and substantial velocity advantages over human teams, while dramatically cutting operational costs (≈98% reduction) to enable continuous defense operations. The authors argue for transitioning to Attack & Defense formats to better test adaptive reasoning and resilience in real-world contexts, particularly for OT security, and discuss ethical, benchmarking, and governance considerations. Collectively, the work demonstrates that AI-enabled security can operate at enterprise scale and challenges the research and industry communities to redefine evaluation standards and deployment strategies for autonomous cybersecurity.

Abstract

Are Capture-the-Flag competitions obsolete? In 2025, Cybersecurity AI (CAI) systematically conquered some of the world's most prestigious hacking competitions, achieving Rank #1 at multiple events and consistently outperforming thousands of human teams. Across five major circuits-HTB's AI vs Humans, Cyber Apocalypse (8,129 teams), Dragos OT CTF, UWSP Pointer Overflow, and the Neurogrid CTF showdown-CAI demonstrated that Jeopardy-style CTFs have become a solved game for well-engineered AI agents. At Neurogrid, CAI captured 41/45 flags to claim the 5,940 to just $119, making continuous security agent operation financially viable for the first time. These results force an uncomfortable reckoning: if autonomous agents now dominate competitions designed to identify top security talent at negligible cost, what are CTFs actually measuring? This paper presents comprehensive evidence of AI capability across the 2025 CTF circuit and argues that the security community must urgently transition from Jeopardy-style contests to Attack & Defense formats that genuinely test adaptive reasoning and resilience-capabilities that remain uniquely human, for now.

Paper Structure

This paper contains 20 sections, 4 equations, 10 figures, 4 tables.

Figures (10)

  • Figure 1: CAI's performance percentile across the 2025 CTF circuit, showing percentage of teams outperformed. Performance zones: ## Elite 1% (crosshatch), // Top 5% (diagonal), and .. Top 10% (dots). All peak performances reached top 5% tier, with final rankings in top 15%.
  • Figure 2: Performance comparison of AI teams in HTB "AI vs Human" CTF. CAI (top) achieved its final flag 30 minutes before the next AI team, demonstrating superior velocity despite equal point totals. The time advantage proved decisive for the #1 AI ranking and $750 prize.
  • Figure 3: CAI's performance improvement between consecutive HTB competitions. In the initial AI vs Human CTF, CAI captured 19 flags/challenges; in Cyber Apocalypse CTF 2025, it reached 30 flags and 20 challenges in the same 3-hour window, illustrating rapid capability evolution in autonomous Jeopardy-style CTF solving.
  • Figure 4: Top-10 trajectories across the 48-hour Dragos OT CTF 2025. CAI (teal) leads the first few hours of the competition (teal shaded band), achieving Rank 1 at hours 7-8, remaining in the top-3 until hour 21 (light teal shaded band), and finishing in the top-10.
  • Figure 5: UWSP Pointer Overflow 2025: Complete 54-day competition timeline. The top three teams competed for 31-54 days to reach 16,000 points. CAI entered on day 51 (November 4) when leaders had already accumulated 15,000+ points, yet achieved 11,500 points in just 60 hours—demonstrating a solve velocity that would have matched top teams if given equal time.
  • ...and 5 more figures