Table of Contents
Fetching ...

Incalmo: An Autonomous LLM-assisted System for Red Teaming Multi-Host Networks

Brian Singer, Keane Lucas, Lakshmi Adiga, Meghna Jain, Lujo Bauer, Vyas Sekar

TL;DR

This work addresses the challenge of autonomously executing red-team attacks across multi-host networks, where prior LLM-based systems struggle due to low-level command focus and context-bloat. Incalmo proposes a two-layer architecture that decouples planning from execution, using high-level declarative tasks and domain-specific agents guided by auxiliary services (environment state, attack graph, and C&C) to manage knowledge and assets. Through MHBench, a 40-environment multi-host benchmark, Incalmo achieves 37/40 successful attacks with rapid execution and low cost, vastly outperforming baselines that only reach 3/40. The results highlight the importance of abstraction, modular agents, and robust context management for scalable, autonomous red-teaming, and the work provides open-source tools to advance reproducibility and further research.

Abstract

Security operators use red teams to simulate real attackers and proactively find defense gaps. In realistic enterprise settings, this involves executing multi-host network attacks spanning many "stepping stone" hosts. Unfortunately, red teams are expensive and entail significant expertise and effort. Given the promise of LLMs in CTF challenges, we first analyze if LLMs can autonomously execute multi-host red team exercises. We find that state-of-the-art LLM-assisted offense systems (e.g., PentestGPT, CyberSecEval3) with leading LLMs (e.g., Sonnet 4, Gemini 2.5 Pro) are unable to do so. Building on our observations in understanding the failure modes of state-of-the-art systems, we argue the need to improve the abstractions and interfaces for LLM-assisted red teaming. Based on this insight, we present the design and implementation of Incalmo, an LLM-assisted system for autonomously red teaming multi-host networks. Incalmo uses LLMs to plan red team exercises in terms of high-level declarative tasks that are executed by domain-specific task agents. Incalmo also uses auxiliary services to manage context and acquired assets. For our evaluation, we develop MHBench, a novel multi-host attack benchmark with 40 realistic emulated networks (from 22 to 50 hosts). We find that Incalmo successfully acquires critical assets (i.e., key hosts or data) in 37 out of 40 MHBench environments. In contrast, state-of-the-art LLM-assisted systems succeed in only 3 out of 40 environments. We show that Incalmo is efficient-successful attacks took 12-54 minutes and cost <$15 in LLM credits.

Incalmo: An Autonomous LLM-assisted System for Red Teaming Multi-Host Networks

TL;DR

This work addresses the challenge of autonomously executing red-team attacks across multi-host networks, where prior LLM-based systems struggle due to low-level command focus and context-bloat. Incalmo proposes a two-layer architecture that decouples planning from execution, using high-level declarative tasks and domain-specific agents guided by auxiliary services (environment state, attack graph, and C&C) to manage knowledge and assets. Through MHBench, a 40-environment multi-host benchmark, Incalmo achieves 37/40 successful attacks with rapid execution and low cost, vastly outperforming baselines that only reach 3/40. The results highlight the importance of abstraction, modular agents, and robust context management for scalable, autonomous red-teaming, and the work provides open-source tools to advance reproducibility and further research.

Abstract

Security operators use red teams to simulate real attackers and proactively find defense gaps. In realistic enterprise settings, this involves executing multi-host network attacks spanning many "stepping stone" hosts. Unfortunately, red teams are expensive and entail significant expertise and effort. Given the promise of LLMs in CTF challenges, we first analyze if LLMs can autonomously execute multi-host red team exercises. We find that state-of-the-art LLM-assisted offense systems (e.g., PentestGPT, CyberSecEval3) with leading LLMs (e.g., Sonnet 4, Gemini 2.5 Pro) are unable to do so. Building on our observations in understanding the failure modes of state-of-the-art systems, we argue the need to improve the abstractions and interfaces for LLM-assisted red teaming. Based on this insight, we present the design and implementation of Incalmo, an LLM-assisted system for autonomously red teaming multi-host networks. Incalmo uses LLMs to plan red team exercises in terms of high-level declarative tasks that are executed by domain-specific task agents. Incalmo also uses auxiliary services to manage context and acquired assets. For our evaluation, we develop MHBench, a novel multi-host attack benchmark with 40 realistic emulated networks (from 22 to 50 hosts). We find that Incalmo successfully acquires critical assets (i.e., key hosts or data) in 37 out of 40 MHBench environments. In contrast, state-of-the-art LLM-assisted systems succeed in only 3 out of 40 environments. We show that Incalmo is efficient-successful attacks took 12-54 minutes and cost <$15 in LLM credits.

Paper Structure

This paper contains 24 sections, 15 figures, 4 tables.

Figures (15)

  • Figure 1: Incalmo is a system for executing multi-host red teams. Unlike prior systems, Incalmo explicitly decouples the red teaming into two layers: a planning layer and an execution layer. Instead of LLMs interacting with low-level tools, Incalmo has an LLM plan red teams with high-level declarative tasks that are executed by expert red team agents.
  • Figure 2: Comparing Reliability across environments between Incalmo and ExpertPromptShell with Sonnet 4, the LLM-based system with the best performance on MHBench. Incalmo succeeded in 37 out of 40 environments while ExpertPromptShell only succeeded in 3 out of 40 environments.
  • Figure 3: The attacker during 2017 Equifax breach executed a multi-host attack. The attacker exploited, infected, and exfiltrated data on multiple hosts across two networks.
  • Figure 4: The Success and TotalAcquisition metrics of LLM offense systems across 10 environments. The systems were largely unable to realize multi-host attacks.
  • Figure 5: A mental model of how red teams execute multi-host attacks. Red teams start with a goal (e.g., exfiltrate important data). Then, they follow a loop of deciding a task (e.g., infect a server), executing the task (e.g., launching an exploit), and updating their knowledge/capabilities.
  • ...and 10 more figures