Table of Contents
Fetching ...

From Capabilities to Performance: Evaluating Key Functional Properties of LLM Architectures in Penetration Testing

Lanxiao Huang, Daksh Dave, Tyler Cody, Peter Beling, Ming Jin

TL;DR

This work systematically evaluates multiple LLM-driven penetration testing agents, identifying core functional capabilities—context coherence, inter-agent coordination, tool-use accuracy, multi-step planning, and real-time responsiveness—that determine performance. It introduces five architectural augmentations (GCM, IAM, CCI, AP, RTM) inspired by Multi-Agent System principles and demonstrates their effectiveness in strengthening modular LLM agents across end-to-end PT tasks mapped to MITRE ATT&CK. Across a diverse testbed of vulnerable machines, results show that augmentations substantially improve reliability and task completion, though real-time tasks like MITM remain challenging for current models. The findings offer a concrete blueprint for designing robust, autonomous offensive security systems by embedding long-horizon memory, structured inter-component grounding, and runtime-event awareness, with implications for safer dual-use deployment and future antifragile AI safety research.

Abstract

Large language models (LLMs) are increasingly used to automate or augment penetration testing, but their effectiveness and reliability across attack phases remain unclear. We present a comprehensive evaluation of multiple LLM-based agents, from single-agent to modular designs, across realistic penetration testing scenarios, measuring empirical performance and recurring failure patterns. We also isolate the impact of five core functional capabilities via targeted augmentations: Global Context Memory (GCM), Inter-Agent Messaging (IAM), Context-Conditioned Invocation (CCI), Adaptive Planning (AP), and Real-Time Monitoring (RTM). These interventions support, respectively: (i) context coherence and retention, (ii) inter-component coordination and state management, (iii) tool use accuracy and selective execution, (iv) multi-step strategic planning, error detection, and recovery, and (v) real-time dynamic responsiveness. Our results show that while some architectures natively exhibit subsets of these properties, targeted augmentations substantially improve modular agent performance, especially in complex, multi-step, and real-time penetration testing tasks.

From Capabilities to Performance: Evaluating Key Functional Properties of LLM Architectures in Penetration Testing

TL;DR

This work systematically evaluates multiple LLM-driven penetration testing agents, identifying core functional capabilities—context coherence, inter-agent coordination, tool-use accuracy, multi-step planning, and real-time responsiveness—that determine performance. It introduces five architectural augmentations (GCM, IAM, CCI, AP, RTM) inspired by Multi-Agent System principles and demonstrates their effectiveness in strengthening modular LLM agents across end-to-end PT tasks mapped to MITRE ATT&CK. Across a diverse testbed of vulnerable machines, results show that augmentations substantially improve reliability and task completion, though real-time tasks like MITM remain challenging for current models. The findings offer a concrete blueprint for designing robust, autonomous offensive security systems by embedding long-horizon memory, structured inter-component grounding, and runtime-event awareness, with implications for safer dual-use deployment and future antifragile AI safety research.

Abstract

Large language models (LLMs) are increasingly used to automate or augment penetration testing, but their effectiveness and reliability across attack phases remain unclear. We present a comprehensive evaluation of multiple LLM-based agents, from single-agent to modular designs, across realistic penetration testing scenarios, measuring empirical performance and recurring failure patterns. We also isolate the impact of five core functional capabilities via targeted augmentations: Global Context Memory (GCM), Inter-Agent Messaging (IAM), Context-Conditioned Invocation (CCI), Adaptive Planning (AP), and Real-Time Monitoring (RTM). These interventions support, respectively: (i) context coherence and retention, (ii) inter-component coordination and state management, (iii) tool use accuracy and selective execution, (iv) multi-step strategic planning, error detection, and recovery, and (v) real-time dynamic responsiveness. Our results show that while some architectures natively exhibit subsets of these properties, targeted augmentations substantially improve modular agent performance, especially in complex, multi-step, and real-time penetration testing tasks.

Paper Structure

This paper contains 84 sections, 3 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: The evaluation working flow of LLM-Guided Penetration Testing: Ethical hackers utilize web searches and cybersecurity expertise, structured through prompt templates, to define penetration testing objectives for LLMs. The LLMs generate the next action to execute external PT tools or issue direct commands to interact with the testing environment. The resulting tool or terminal feedback are then analyzed by the LLMs to determine subsequent steps, ensuring an iterative and adaptive testing process.
  • Figure 2: LLM Performance Drop-Off Across Penetration Testing Task Complexity: the average performance trend (indicated by success rate) of models across penetration testing tasks, arranged in increasing complexity.
  • Figure 3: Distribution of Failure Modes Across LLMs in Penetration Testing: the percentage (FR) of different failure modes encountered across various LLMs during penetration testing tasks.
  • Figure 4: Context Retention Timeline: Cumulative Errors Over Steps. The X-axis represents the interaction steps (phases in the exploitation workflow), and the Y-axis shows the cumulative count of context errors.
  • Figure 5: Risk-Task Matrix with Recommended Human Oversight. Tasks are ordered from least to most complex (bottom to top), with risk levels (Low, Medium, High) categorized along the columns. The intervention score (numerical values) represents the degree of human oversight needed, with higher values indicating greater human involvement.
  • ...and 2 more figures