Table of Contents
Fetching ...

AutoPenBench: Benchmarking Generative Agents for Penetration Testing

Luca Gioacchini, Marco Mellia, Idilio Drago, Alexander Delsanto, Giuseppe Siracusano, Roberto Bifulco

TL;DR

<3-5 sentence high-level summary> AutoPenBench addresses the lack of a standardized, open benchmark for evaluating generative agents in automated penetration testing. It introduces 33 Docker-based, CTF-style tasks across in-vitro and real-world CVE scenarios, with milestone-driven progress and CoALA-inspired agent architectures (autonomous and assisted). The study demonstrates substantial gains from human-in-the-loop collaboration (SR improving from 21% to 64%) and reveals critical limitations in fully autonomous operation, especially in exploitation of complex vulnerabilities. The work lays a foundation for reproducible evaluation and future expansion to broader tasks, LLMs, and retrieval-augmented approaches to advance automated pentesting research.

Abstract

Generative AI agents, software systems powered by Large Language Models (LLMs), are emerging as a promising approach to automate cybersecurity tasks. Among the others, penetration testing is a challenging field due to the task complexity and the diverse strategies to simulate cyber-attacks. Despite growing interest and initial studies in automating penetration testing with generative agents, there remains a significant gap in the form of a comprehensive and standard framework for their evaluation and development. This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. Tasks are of increasing difficulty levels, including in-vitro and real-world scenarios. We assess the agent performance with generic and specific milestones that allow us to compare results in a standardised manner and understand the limits of the agent under test. We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction. We compare their performance and limitations. For example, the fully autonomous agent performs unsatisfactorily achieving a 21% Success Rate (SR) across the benchmark, solving 27% of the simple tasks and only one real-world task. In contrast, the assisted agent demonstrates substantial improvements, with 64% of SR. AutoPenBench allows us also to observe how different LLMs like GPT-4o or OpenAI o1 impact the ability of the agents to complete the tasks. We believe that our benchmark fills the gap with a standard and flexible framework to compare penetration testing agents on a common ground. We hope to extend AutoPenBench along with the research community by making it available under https://github.com/lucagioacchini/auto-pen-bench.

AutoPenBench: Benchmarking Generative Agents for Penetration Testing

TL;DR

<3-5 sentence high-level summary> AutoPenBench addresses the lack of a standardized, open benchmark for evaluating generative agents in automated penetration testing. It introduces 33 Docker-based, CTF-style tasks across in-vitro and real-world CVE scenarios, with milestone-driven progress and CoALA-inspired agent architectures (autonomous and assisted). The study demonstrates substantial gains from human-in-the-loop collaboration (SR improving from 21% to 64%) and reveals critical limitations in fully autonomous operation, especially in exploitation of complex vulnerabilities. The work lays a foundation for reproducible evaluation and future expansion to broader tasks, LLMs, and retrieval-augmented approaches to advance automated pentesting research.

Abstract

Generative AI agents, software systems powered by Large Language Models (LLMs), are emerging as a promising approach to automate cybersecurity tasks. Among the others, penetration testing is a challenging field due to the task complexity and the diverse strategies to simulate cyber-attacks. Despite growing interest and initial studies in automating penetration testing with generative agents, there remains a significant gap in the form of a comprehensive and standard framework for their evaluation and development. This paper introduces AutoPenBench, an open benchmark for evaluating generative agents in automated penetration testing. We present a comprehensive framework that includes 33 tasks, each representing a vulnerable system that the agent has to attack. Tasks are of increasing difficulty levels, including in-vitro and real-world scenarios. We assess the agent performance with generic and specific milestones that allow us to compare results in a standardised manner and understand the limits of the agent under test. We show the benefits of AutoPenBench by testing two agent architectures: a fully autonomous and a semi-autonomous supporting human interaction. We compare their performance and limitations. For example, the fully autonomous agent performs unsatisfactorily achieving a 21% Success Rate (SR) across the benchmark, solving 27% of the simple tasks and only one real-world task. In contrast, the assisted agent demonstrates substantial improvements, with 64% of SR. AutoPenBench allows us also to observe how different LLMs like GPT-4o or OpenAI o1 impact the ability of the agents to complete the tasks. We believe that our benchmark fills the gap with a standard and flexible framework to compare penetration testing agents on a common ground. We hope to extend AutoPenBench along with the research community by making it available under https://github.com/lucagioacchini/auto-pen-bench.
Paper Structure (23 sections, 5 figures, 5 tables, 2 algorithms)

This paper contains 23 sections, 5 figures, 5 tables, 2 algorithms.

Figures (5)

  • Figure 1: Overview of the penetration test infrastructure.
  • Figure 2: Example of commands executed by our autonomous agent when accomplishing a real-world pentest task involving the exploitation of the CVE-2024-36401 for the CVE$_0$ task. Each command corresponds to a reached command milestone, whereas different colours indicate different stage milestones.
  • Figure 3: Overview of the agent procedures executed in a single execution step for autonomous and assisted agents. Reasoning procedures are in light grey.
  • Figure 4: Success Rate of each pentest stage for real-world tasks (CVE). The right y-axis reports the SR relative to the previous stage.
  • Figure 5: Distributions of steps at which the agent achieves each pentest stage over 10 runs of the same task.