Table of Contents
Fetching ...

Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

Isamu Isozaki, Manil Shrestha, Rick Console, Edward Kim

TL;DR

The paper tackles the escalating threat of cybercrime by proposing an open benchmark for evaluating LLM-based automated penetration testing. It assesses two leading LLMs (GPT-4o and Llama3.1-405B) within the PentestGPT framework on Vulnhub-based tasks, revealing end-to-end automation remains unattained but identifying concrete areas for improvement. Through three cumulative ablations—Inject Summary, Structured Generation, and Retrieval Augmented Generation—it demonstrates that augmenting context and task management can yield measurable gains, with RAG offering the strongest overall benefit. The work provides a standardized evaluation protocol and actionable insights for advancing AI-assisted cybersecurity, while also highlighting risks and limitations that motivate future reinforcement learning and self-play research.

Abstract

Hacking poses a significant threat to cybersecurity, inflicting billions of dollars in damages annually. To mitigate these risks, ethical hacking, or penetration testing, is employed to identify vulnerabilities in systems and networks. Recent advancements in large language models (LLMs) have shown potential across various domains, including cybersecurity. However, there is currently no comprehensive, open, automated, end-to-end penetration testing benchmark to drive progress and evaluate the capabilities of these models in security contexts. This paper introduces a novel open benchmark for LLM-based automated penetration testing, addressing this critical gap. We first evaluate the performance of LLMs, including GPT-4o and LLama 3.1-405B, using the state-of-the-art PentestGPT tool. Our findings reveal that while LLama 3.1 demonstrates an edge over GPT-4o, both models currently fall short of performing end-to-end penetration testing even with some minimal human assistance. Next, we advance the state-of-the-art and present ablation studies that provide insights into improving the PentestGPT tool. Our research illuminates the challenges LLMs face in each aspect of Pentesting, e.g. enumeration, exploitation, and privilege escalation. This work contributes to the growing body of knowledge on AI-assisted cybersecurity and lays the foundation for future research in automated penetration testing using large language models.

Towards Automated Penetration Testing: Introducing LLM Benchmark, Analysis, and Improvements

TL;DR

The paper tackles the escalating threat of cybercrime by proposing an open benchmark for evaluating LLM-based automated penetration testing. It assesses two leading LLMs (GPT-4o and Llama3.1-405B) within the PentestGPT framework on Vulnhub-based tasks, revealing end-to-end automation remains unattained but identifying concrete areas for improvement. Through three cumulative ablations—Inject Summary, Structured Generation, and Retrieval Augmented Generation—it demonstrates that augmenting context and task management can yield measurable gains, with RAG offering the strongest overall benefit. The work provides a standardized evaluation protocol and actionable insights for advancing AI-assisted cybersecurity, while also highlighting risks and limitations that motivate future reinforcement learning and self-play research.

Abstract

Hacking poses a significant threat to cybersecurity, inflicting billions of dollars in damages annually. To mitigate these risks, ethical hacking, or penetration testing, is employed to identify vulnerabilities in systems and networks. Recent advancements in large language models (LLMs) have shown potential across various domains, including cybersecurity. However, there is currently no comprehensive, open, automated, end-to-end penetration testing benchmark to drive progress and evaluate the capabilities of these models in security contexts. This paper introduces a novel open benchmark for LLM-based automated penetration testing, addressing this critical gap. We first evaluate the performance of LLMs, including GPT-4o and LLama 3.1-405B, using the state-of-the-art PentestGPT tool. Our findings reveal that while LLama 3.1 demonstrates an edge over GPT-4o, both models currently fall short of performing end-to-end penetration testing even with some minimal human assistance. Next, we advance the state-of-the-art and present ablation studies that provide insights into improving the PentestGPT tool. Our research illuminates the challenges LLMs face in each aspect of Pentesting, e.g. enumeration, exploitation, and privilege escalation. This work contributes to the growing body of knowledge on AI-assisted cybersecurity and lays the foundation for future research in automated penetration testing using large language models.

Paper Structure

This paper contains 19 sections, 13 figures, 4 tables.

Figures (13)

  • Figure 1: Current LLM-Based Pentest Benchmarking Process: LLM-suggested actions are executed by human operators. Feedback is provided to the LLM-powered Pentest tool, and human operators assess the tool's performance throughout the process. The future goal is to minimize human interaction in this process.
  • Figure 2: Task Distribution Across Penetration Testing Categories: Illustrates the distribution of tasks across four key categories in penetration testing: Reconnaissance, Exploitation, Privilege Escalation, and General Techniques.
  • Figure 3: Categories Density Through Task Sequence: The figure shows how reconnaissance tasks dominate the early stages of penetration testing, while exploitation and privilege escalation are more frequent toward the end.
  • Figure 4: This chart displays the performance of the PentestGPT tool benchmark using two popular LLMs: GPT-4o and Llama3.1-405B. Llama3.1 outperforms GPT-4o on 7 machines, both models show equal performance on 4 machines, and GPT-4o performs better on 2 machines.
  • Figure 5: Three ablations were performed in this study: (a) Base PentestGPT (b) Ablation 1: Inject Summary - We maintain the summary and create a summary of past summaries to preserve the knowledge of progress made and maintain history. (c) Ablation 2: Structured Generation - Here we have updated the reasoning module to maintain a structured todo list instead of an unstructured Penetration Testing Tree (PTT). Ablation 2 includes the changes from Ablation 1. (d) Ablation 3: RAG Context - Building on Ablations 1 and 2, we add RAG context based on data scraped from Hacktricks hacktricks_readme. RAG retrieves similar chunks from the vectorDB to add to the context of the reasoning module.
  • ...and 8 more figures