Table of Contents
Fetching ...

Guided Reasoning in LLM-Driven Penetration Testing Using Structured Attack Trees

Katsuaki Nakano, Reza Fayyazi, Shanchieh Jay Yang, Michael Zuzak

TL;DR

The paper tackles the instability of self-guided LLMs in automated penetration testing by introducing a Structured Task Tree (STT) that constrains reasoning to well-defined MITRE ATT&CK-based tactics and techniques. The proposed deterministic reasoning pipeline enables external task tracking and explicit progress, reducing hallucinations and improving efficiency across three diverse LLMs (Llama-3-8B, Gemini-1.5, GPT-4) on 10 HackTheBox machines. Across 103 subtasks, the STT approach achieves substantially higher subtask completion and up to ~56% fewer LLM queries than a state-of-the-art baseline, with smaller models benefiting most. The work provides a transferable framework for grounded, auditable, and efficient LLM-based penetration testing and outlines clear avenues for enhancement via CVE retrieval and multimodal tool interaction to handle more complex exploits.

Abstract

Recent advances in Large Language Models (LLMs) have driven interest in automating cybersecurity penetration testing workflows, offering the promise of faster and more consistent vulnerability assessment for enterprise systems. Existing LLM agents for penetration testing primarily rely on self-guided reasoning, which can produce inaccurate or hallucinated procedural steps. As a result, the LLM agent may undertake unproductive actions, such as exploiting unused software libraries or generating cyclical responses that repeat prior tactics. In this work, we propose a guided reasoning pipeline for penetration testing LLM agents that incorporates a deterministic task tree built from the MITRE ATT&CK Matrix, a proven penetration testing kll chain, to constrain the LLM's reaoning process to explicitly defined tactics, techniques, and procedures. This anchors reasoning in proven penetration testing methodologies and filters out ineffective actions by guiding the agent towards more productive attack procedures. To evaluate our approach, we built an automated penetration testing LLM agent using three LLMs (Llama-3-8B, Gemini-1.5, and GPT-4) and applied it to navigate 10 HackTheBox cybersecurity exercises with 103 discrete subtasks representing real-world cyberattack scenarios. Our proposed reasoning pipeline guided the LLM agent through 71.8\%, 72.8\%, and 78.6\% of subtasks using Llama-3-8B, Gemini-1.5, and GPT-4, respectively. Comparatively, the state-of-the-art LLM penetration testing tool using self-guided reasoning completed only 13.5\%, 16.5\%, and 75.7\% of subtasks and required 86.2\%, 118.7\%, and 205.9\% more model queries. This suggests that incorporating a deterministic task tree into LLM reasoning pipelines can enhance the accuracy and efficiency of automated cybersecurity assessments

Guided Reasoning in LLM-Driven Penetration Testing Using Structured Attack Trees

TL;DR

The paper tackles the instability of self-guided LLMs in automated penetration testing by introducing a Structured Task Tree (STT) that constrains reasoning to well-defined MITRE ATT&CK-based tactics and techniques. The proposed deterministic reasoning pipeline enables external task tracking and explicit progress, reducing hallucinations and improving efficiency across three diverse LLMs (Llama-3-8B, Gemini-1.5, GPT-4) on 10 HackTheBox machines. Across 103 subtasks, the STT approach achieves substantially higher subtask completion and up to ~56% fewer LLM queries than a state-of-the-art baseline, with smaller models benefiting most. The work provides a transferable framework for grounded, auditable, and efficient LLM-based penetration testing and outlines clear avenues for enhancement via CVE retrieval and multimodal tool interaction to handle more complex exploits.

Abstract

Recent advances in Large Language Models (LLMs) have driven interest in automating cybersecurity penetration testing workflows, offering the promise of faster and more consistent vulnerability assessment for enterprise systems. Existing LLM agents for penetration testing primarily rely on self-guided reasoning, which can produce inaccurate or hallucinated procedural steps. As a result, the LLM agent may undertake unproductive actions, such as exploiting unused software libraries or generating cyclical responses that repeat prior tactics. In this work, we propose a guided reasoning pipeline for penetration testing LLM agents that incorporates a deterministic task tree built from the MITRE ATT&CK Matrix, a proven penetration testing kll chain, to constrain the LLM's reaoning process to explicitly defined tactics, techniques, and procedures. This anchors reasoning in proven penetration testing methodologies and filters out ineffective actions by guiding the agent towards more productive attack procedures. To evaluate our approach, we built an automated penetration testing LLM agent using three LLMs (Llama-3-8B, Gemini-1.5, and GPT-4) and applied it to navigate 10 HackTheBox cybersecurity exercises with 103 discrete subtasks representing real-world cyberattack scenarios. Our proposed reasoning pipeline guided the LLM agent through 71.8\%, 72.8\%, and 78.6\% of subtasks using Llama-3-8B, Gemini-1.5, and GPT-4, respectively. Comparatively, the state-of-the-art LLM penetration testing tool using self-guided reasoning completed only 13.5\%, 16.5\%, and 75.7\% of subtasks and required 86.2\%, 118.7\%, and 205.9\% more model queries. This suggests that incorporating a deterministic task tree into LLM reasoning pipelines can enhance the accuracy and efficiency of automated cybersecurity assessments

Paper Structure

This paper contains 17 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: Pipeline of the proposed STT-based reasoning method (red labels 1–4) and prior self-guided reasoning method (blue labels A–C) by pentestgpt. Note that arrows indicate a transition between processes, and dotted arrows represent an interaction between the LLM and the STT.
  • Figure 2: The task selection flow of proposed STT-based reasoning pipeline.