Table of Contents
Fetching ...

Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

Justin W. Lin, Eliot Krzysztof Jones, Donovan Julian Jasper, Ethan Jun-shen Ho, Anna Wu, Arnold Tianyi Yang, Neil Perry, Andy Zou, Matt Fredrikson, J. Zico Kolter, Percy Liang, Dan Boneh, Daniel E. Ho

TL;DR

This work presents the first live, large-scale comparison of AI agents and cybersecurity professionals in a real enterprise environment and introduces ARTEMIS, a multi-agent scaffold with dynamic prompt generation and a triage module for long-horizon penetration testing. In a 16-hour evaluation on a university network (~8,000 hosts), ARTEMIS placed second, outperforming many existing AI scaffolds and offering notable cost savings versus human testers. The study highlights AI agents' strengths in systematic enumeration and parallel exploitation while noting weaknesses in false positives and GUI-based tasks, and it makes ARTEMIS openly available to advance realistic AI cybersecurity evaluations. Overall, the results suggest AI agents can match or approach human performance under realistic conditions, with clear implications for defense tooling and future research into more robust GUI interaction and context management.

Abstract

We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in systematic enumeration, parallel exploitation, and cost -- certain ARTEMIS variants cost $18/hour versus $60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.

Comparing AI Agents to Cybersecurity Professionals in Real-World Penetration Testing

TL;DR

This work presents the first live, large-scale comparison of AI agents and cybersecurity professionals in a real enterprise environment and introduces ARTEMIS, a multi-agent scaffold with dynamic prompt generation and a triage module for long-horizon penetration testing. In a 16-hour evaluation on a university network (~8,000 hosts), ARTEMIS placed second, outperforming many existing AI scaffolds and offering notable cost savings versus human testers. The study highlights AI agents' strengths in systematic enumeration and parallel exploitation while noting weaknesses in false positives and GUI-based tasks, and it makes ARTEMIS openly available to advance realistic AI cybersecurity evaluations. Overall, the results suggest AI agents can match or approach human performance under realistic conditions, with clear implications for defense tooling and future research into more robust GUI interaction and context management.

Abstract

We present the first comprehensive evaluation of AI agents against human cybersecurity professionals in a live enterprise environment. We evaluate ten cybersecurity professionals alongside six existing AI agents and ARTEMIS, our new agent scaffold, on a large university network consisting of ~8,000 hosts across 12 subnets. ARTEMIS is a multi-agent framework featuring dynamic prompt generation, arbitrary sub-agents, and automatic vulnerability triaging. In our comparative study, ARTEMIS placed second overall, discovering 9 valid vulnerabilities with an 82% valid submission rate and outperforming 9 of 10 human participants. While existing scaffolds such as Codex and CyAgent underperformed relative to most human participants, ARTEMIS demonstrated technical sophistication and submission quality comparable to the strongest participants. We observe that AI agents offer advantages in systematic enumeration, parallel exploitation, and cost -- certain ARTEMIS variants cost 60/hour for professional penetration testers. We also identify key capability gaps: AI agents exhibit higher false-positive rates and struggle with GUI-based tasks.

Paper Structure

This paper contains 70 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: ARTEMIS is a complex multi-agent framework consisting of a high-level supervisor, unlimited sub-agents with dynamically created expert system prompts, and a triage module. It is designed to complete long-horizon, complex, penetration testing on real-world production systems.
  • Figure 2: Comparison of success rates on Cybench. Aside from ARTEMIS and GPT-5 results, all numbers are taken from the Cybench https://cybench.github.io/.
  • Figure 3: Number of valid participant findings over time. It is noteworthy that ARTEMIS typically has more time in between submissions than humans, suggesting impressive long-horizon performance. *We note that $P_{1}$ did a significant amount of external reconnaissance work before receiving a provisioned VM. Thus, $P_1$'s greater familiarity with the external environment accelerated progress during the engagement.
  • Figure 4: Overlap of all vulnerabilities across all human participants and two ARTEMIS variants.