Cybersecurity AI: The World's Top AI Agent for Security Capture-the-Flag (CTF)
Víctor Mayoral-Vilches, Luis Javier Navarrete-Lozano, Francesco Balassone, María Sanz-Gómez, Cristóbal R. J. Veas Chavez, Maite del Mundo de Torres, Vanesa Turiel
TL;DR
The paper investigates the obsolescence of Jeopardy-style CTFs in the era of autonomous AI security agents and presents CAI, an alias1-based framework with entropy-driven multi-model orchestration, as a scalable, cost-efficient solution. Across five major CTF circuits, CAI achieves top ranks, high solve rates (up to 91%), and substantial velocity advantages over human teams, while dramatically cutting operational costs (≈98% reduction) to enable continuous defense operations. The authors argue for transitioning to Attack & Defense formats to better test adaptive reasoning and resilience in real-world contexts, particularly for OT security, and discuss ethical, benchmarking, and governance considerations. Collectively, the work demonstrates that AI-enabled security can operate at enterprise scale and challenges the research and industry communities to redefine evaluation standards and deployment strategies for autonomous cybersecurity.
Abstract
Are Capture-the-Flag competitions obsolete? In 2025, Cybersecurity AI (CAI) systematically conquered some of the world's most prestigious hacking competitions, achieving Rank #1 at multiple events and consistently outperforming thousands of human teams. Across five major circuits-HTB's AI vs Humans, Cyber Apocalypse (8,129 teams), Dragos OT CTF, UWSP Pointer Overflow, and the Neurogrid CTF showdown-CAI demonstrated that Jeopardy-style CTFs have become a solved game for well-engineered AI agents. At Neurogrid, CAI captured 41/45 flags to claim the $50,000 top prize; at Dragos OT, it sprinted 37% faster to 10K points than elite human teams; even when deliberately paused mid-competition, it maintained top-tier rankings. Critically, CAI achieved this dominance through our specialized alias1 model architecture, which delivers enterprise-scale AI security operations at unprecedented cost efficiency and with augmented autonomy-reducing 1B token inference costs from $5,940 to just $119, making continuous security agent operation financially viable for the first time. These results force an uncomfortable reckoning: if autonomous agents now dominate competitions designed to identify top security talent at negligible cost, what are CTFs actually measuring? This paper presents comprehensive evidence of AI capability across the 2025 CTF circuit and argues that the security community must urgently transition from Jeopardy-style contests to Attack & Defense formats that genuinely test adaptive reasoning and resilience-capabilities that remain uniquely human, for now.
