Table of Contents
Fetching ...

Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting

Yuqiao Meng, Luoxi Tang, Feiyang Yu, Xi Li, Guanhua Yan, Ping Yang, Zhaohan Xi

TL;DR

CyberTeam tackles the challenge of evaluating LLMs in realistic blue-team threat hunting by imposing a dependency-driven task chain and a library of modular operations that mirror analyst workflows. By integrating 30 threat-hunting tasks with 9 operational modules and diverse, publicly available threat intelligence sources, CyberTeam enables rigorous, repeatable evaluation and comparison of LLMs and cybersecurity agents against open-ended prompting. The empirical results show substantial performance gains from standardized, modular reasoning, particularly for complex, multi-step tasks, and reveal limitations of open-ended approaches in real-world threat hunting. The benchmark and open-source code aim to accelerate the adoption of AI-assisted blue-team workflows and guide future research on reliable, interpretable cyber defense AI systems.

Abstract

As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat analysis. However, their effectiveness in real-world blue team threat-hunting scenarios remains insufficiently explored. This paper presents CyberTeam, a benchmark designed to guide LLMs in blue teaming practice. CyberTeam constructs a standardized workflow in two stages. First, it models realistic threat-hunting workflows by capturing the dependencies among analytical tasks from threat attribution to incident response. Next, each task is addressed through a set of operational modules tailored to its specific analytical requirements. This transforms threat hunting into a structured sequence of reasoning steps, with each step grounded in a discrete operation and ordered according to task-specific dependencies. Guided by this framework, LLMs are directed to perform threat-hunting tasks through modularized steps. Overall, CyberTeam integrates 30 tasks and 9 operational modules to guide LLMs through standardized threat analysis. We evaluate both leading LLMs and state-of-the-art cybersecurity agents, comparing CyberTeam against open-ended reasoning strategies. Our results highlight the improvements enabled by standardized design, while also revealing the limitations of open-ended reasoning in real-world threat hunting.

Benchmarking LLM-Assisted Blue Teaming via Standardized Threat Hunting

TL;DR

CyberTeam tackles the challenge of evaluating LLMs in realistic blue-team threat hunting by imposing a dependency-driven task chain and a library of modular operations that mirror analyst workflows. By integrating 30 threat-hunting tasks with 9 operational modules and diverse, publicly available threat intelligence sources, CyberTeam enables rigorous, repeatable evaluation and comparison of LLMs and cybersecurity agents against open-ended prompting. The empirical results show substantial performance gains from standardized, modular reasoning, particularly for complex, multi-step tasks, and reveal limitations of open-ended approaches in real-world threat hunting. The benchmark and open-source code aim to accelerate the adoption of AI-assisted blue-team workflows and guide future research on reliable, interpretable cyber defense AI systems.

Abstract

As cyber threats continue to grow in scale and sophistication, blue team defenders increasingly require advanced tools to proactively detect and mitigate risks. Large Language Models (LLMs) offer promising capabilities for enhancing threat analysis. However, their effectiveness in real-world blue team threat-hunting scenarios remains insufficiently explored. This paper presents CyberTeam, a benchmark designed to guide LLMs in blue teaming practice. CyberTeam constructs a standardized workflow in two stages. First, it models realistic threat-hunting workflows by capturing the dependencies among analytical tasks from threat attribution to incident response. Next, each task is addressed through a set of operational modules tailored to its specific analytical requirements. This transforms threat hunting into a structured sequence of reasoning steps, with each step grounded in a discrete operation and ordered according to task-specific dependencies. Guided by this framework, LLMs are directed to perform threat-hunting tasks through modularized steps. Overall, CyberTeam integrates 30 tasks and 9 operational modules to guide LLMs through standardized threat analysis. We evaluate both leading LLMs and state-of-the-art cybersecurity agents, comparing CyberTeam against open-ended reasoning strategies. Our results highlight the improvements enabled by standardized design, while also revealing the limitations of open-ended reasoning in real-world threat hunting.

Paper Structure

This paper contains 33 sections, 6 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: A CyberTeam threat hunting example equipped with operational modules. Module names: NER--named entity recognition, REX--regex parsing, MAP--text mapping, RAG--retrieval-augmented generation, CLS--classification, SUM--summarization.
  • Figure 2: A threat hunting example demonstrating a dependency chain of analytical tasks, where each task is completed through a sequence of operational modules executed by LLMs autonomously.
  • Figure 3: Threat-hunting performance (scaled to 100%) on individual tasks, evaluating under GPT-o4-mini. Results for additional LLMs are provided in Appendix \ref{['app:expt']}.
  • Figure 4: LLM performance (metrics corresponding to Table \ref{['tab:benchmark']}) when input threat logs are perturbed with token-level noise (solid line) or semantic-level noise (dashed line). X-axis shows the noise ratios.
  • Figure 5: Threat-hunting performance on individual tasks, evaluating under GPT-4o.
  • ...and 3 more figures