Table of Contents
Fetching ...

CyberSleuth: Autonomous Blue-Team LLM Agent for Web Attack Forensics

Stefano Fumero, Kai Huang, Matteo Boffa, Danilo Giordano, Marco Mellia, Dario Rossi

TL;DR

The findings show that (i) multi-agent specialisation is key to sustained reasoning; (ii) simple orchestration outperforms nested hierarchical architectures; and (iii) the CyberSleuth design generalises across different forensic tasks.

Abstract

Post-mortem analysis of compromised systems is a key aspect of cyber forensics, today a mostly manual, slow, and error-prone task. Agentic AI, i.e., LLM-powered agents, is a promising avenue for automation. However, applying such agents to cybersecurity remains largely unexplored and difficult, as this domain demands long-term reasoning, contextual memory, and consistent evidence correlation - capabilities that current LLM agents struggle to master. In this paper, we present the first systematic study of LLM agents to automate post-mortem investigation. As a first scenario, we consider realistic attacks in which remote attackers try to abuse online services using well-known CVEs (30 controlled cases). The agent receives as input the network traces of the attack and extracts forensic evidence. We compare three AI agent architectures, six LLM backends, and assess their ability to i) identify compromised services, ii) map exploits to exact CVEs, and iii) prepare thorough reports. Our best-performing system, CyberSleuth, achieves 80% accuracy on 2025 incidents, producing complete, coherent, and practically useful reports (judged by a panel of 25 experts). We next illustrate how readily CyberSleuth adapts to face the analysis of infected machine traffic, showing that the effective AI agent design can transfer across forensic tasks. Our findings show that (i) multi-agent specialisation is key to sustained reasoning; (ii) simple orchestration outperforms nested hierarchical architectures; and (iii) the CyberSleuth design generalises across different forensic tasks.

CyberSleuth: Autonomous Blue-Team LLM Agent for Web Attack Forensics

TL;DR

The findings show that (i) multi-agent specialisation is key to sustained reasoning; (ii) simple orchestration outperforms nested hierarchical architectures; and (iii) the CyberSleuth design generalises across different forensic tasks.

Abstract

Post-mortem analysis of compromised systems is a key aspect of cyber forensics, today a mostly manual, slow, and error-prone task. Agentic AI, i.e., LLM-powered agents, is a promising avenue for automation. However, applying such agents to cybersecurity remains largely unexplored and difficult, as this domain demands long-term reasoning, contextual memory, and consistent evidence correlation - capabilities that current LLM agents struggle to master. In this paper, we present the first systematic study of LLM agents to automate post-mortem investigation. As a first scenario, we consider realistic attacks in which remote attackers try to abuse online services using well-known CVEs (30 controlled cases). The agent receives as input the network traces of the attack and extracts forensic evidence. We compare three AI agent architectures, six LLM backends, and assess their ability to i) identify compromised services, ii) map exploits to exact CVEs, and iii) prepare thorough reports. Our best-performing system, CyberSleuth, achieves 80% accuracy on 2025 incidents, producing complete, coherent, and practically useful reports (judged by a panel of 25 experts). We next illustrate how readily CyberSleuth adapts to face the analysis of infected machine traffic, showing that the effective AI agent design can transfer across forensic tasks. Our findings show that (i) multi-agent specialisation is key to sustained reasoning; (ii) simple orchestration outperforms nested hierarchical architectures; and (iii) the CyberSleuth design generalises across different forensic tasks.

Paper Structure

This paper contains 27 sections, 1 equation, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of agent architectures. Each architecture receives the traffic trace and processes it through a pipeline of pre-processing, reporting, and reasoning. The agents connect to the desired LLM backend via API, interact with tools and sub-agents (tshark, Flow Summariser, and Web Search) to extract evidence and produce the forensic report.
  • Figure 2: Breakdown of correct identifications (top: service, bottom: CVE) over 20 incidents. FRA outperforms others, both in simpler (left) and harder (right) samples.
  • Figure 3: Web-search outcomes across incidents, with red cells marking failed CVE identifications. FRA issues more accurate web queries, succeeding where SA and TEA fail.
  • Figure 4: Breakdown of web-search behaviour and CVE identification outcomes across 60 runs. Models differ sharply in how they balance web use and reasoning for CVE detection. The summed detection rates match Table \ref{['tab:llm_ablation']}.
  • Figure 5: Average scores for Completeness, Usefulness, and Logical Coherence by incident and expertise level. Both o3- and DeepSeek-based agents are rated highly across all evaluation criteria. Whiskers show min–max range.