Table of Contents
Fetching ...

AttackSeqBench: Benchmarking Large Language Models in Analyzing Attack Sequences within Cyber Threat Intelligence

Haokai Ma, Javier Yong, Yunshan Ma, Kuei Chen, Anis Yusof, Zhenkai Liang, Ee-Chien Chang

TL;DR

AttackSeqBench introduces a CTI-focused benchmark to evaluate LLMs and related models on reasoning about attack sequences derived from real-world CTI reports using the MITRE ATT&CK framework. It formalizes attack sequences, builds a large QA dataset through an automated pipeline, and evaluates multiple models and post-training strategies under zero-shot, context, and RAG-enabled settings. The study reveals that while large LLMs generally scale well, no single model dominates across tasks, context boosts performance, and naive RAG can introduce noise, highlighting the need for domain-specific knowledge injection and better retrieval methods. Collectively, the work demonstrates the feasibility and limits of automated attack-sequence understanding in CTI, and provides actionable guidance for developing domain-aware, extensible, and scalable security reasoning systems.

Abstract

Cyber Threat Intelligence (CTI) reports document observations of cyber threats, synthesizing evidence about adversaries' actions and intent into actionable knowledge that informs detection, response, and defense planning. However, the unstructured and verbose nature of CTI reports poses significant challenges for security practitioners to manually extract and analyze such sequences. Although large language models (LLMs) exhibit promise in cybersecurity tasks such as entity extraction and knowledge graph construction, their understanding and reasoning capabilities towards behavioral sequences remains underexplored. To address this, we introduce AttackSeqBench, a benchmark designed to systematically evaluate LLMs' reasoning abilities across the tactical, technical, and procedural dimensions of adversarial behaviors, while satisfying Extensibility, Reasoning Scalability, and Domain-dpecific Epistemic Expandability. We further benchmark 7 LLMs, 5 LRMs and 4 post-training strategies across the proposed 3 benchmark settings and 3 benchmark tasks within our AttackSeqBench to identify their advantages and limitations in such specific domain. Our findings contribute to a deeper understanding of LLM-driven CTI report understanding and foster its application in cybersecurity operations.

AttackSeqBench: Benchmarking Large Language Models in Analyzing Attack Sequences within Cyber Threat Intelligence

TL;DR

AttackSeqBench introduces a CTI-focused benchmark to evaluate LLMs and related models on reasoning about attack sequences derived from real-world CTI reports using the MITRE ATT&CK framework. It formalizes attack sequences, builds a large QA dataset through an automated pipeline, and evaluates multiple models and post-training strategies under zero-shot, context, and RAG-enabled settings. The study reveals that while large LLMs generally scale well, no single model dominates across tasks, context boosts performance, and naive RAG can introduce noise, highlighting the need for domain-specific knowledge injection and better retrieval methods. Collectively, the work demonstrates the feasibility and limits of automated attack-sequence understanding in CTI, and provides actionable guidance for developing domain-aware, extensible, and scalable security reasoning systems.

Abstract

Cyber Threat Intelligence (CTI) reports document observations of cyber threats, synthesizing evidence about adversaries' actions and intent into actionable knowledge that informs detection, response, and defense planning. However, the unstructured and verbose nature of CTI reports poses significant challenges for security practitioners to manually extract and analyze such sequences. Although large language models (LLMs) exhibit promise in cybersecurity tasks such as entity extraction and knowledge graph construction, their understanding and reasoning capabilities towards behavioral sequences remains underexplored. To address this, we introduce AttackSeqBench, a benchmark designed to systematically evaluate LLMs' reasoning abilities across the tactical, technical, and procedural dimensions of adversarial behaviors, while satisfying Extensibility, Reasoning Scalability, and Domain-dpecific Epistemic Expandability. We further benchmark 7 LLMs, 5 LRMs and 4 post-training strategies across the proposed 3 benchmark settings and 3 benchmark tasks within our AttackSeqBench to identify their advantages and limitations in such specific domain. Our findings contribute to a deeper understanding of LLM-driven CTI report understanding and foster its application in cybersecurity operations.

Paper Structure

This paper contains 36 sections, 9 figures, 7 tables.

Figures (9)

  • Figure 1: Illustration an example cyber attack sequence and our AttackSeqBench.
  • Figure 2: Overview of our automated QA dataset construction pipeline.
  • Figure 3: Overview of the three benchmark settings that exhibit varying levels of contextual information given to the LLM.
  • Figure 4: Parameter sensitivity analysis on (a) Temperature and (b) Max Output Tokens in AttackSeq-Tactic under the zero-shot setting.
  • Figure 5: Computational complexity analysis of seven LLMs and four LRMs in AttackSeq-Tactic under the regular setting. The size of bubble represents inference time, where zigzag lines denote LLMs and cross hatch lines indicate LRMs.
  • ...and 4 more figures