Table of Contents
Fetching ...

CTIArena: Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence

Yutong Cheng, Yang Liu, Changze Li, Dawn Song, Peng Gao

TL;DR

CTIArena presents the first broad benchmark for evaluating LLM knowledge and reasoning over heterogeneous, multi-source CTI with knowledge augmentation. It delineates nine tasks across three CTI categories (structured, unstructured, hybrid) and introduces a principled three-stage pipeline for dataset construction grounded in authoritative CTI sources. Experimental results show that closed-book LLMs struggle on CTI tasks, but domain-tailored retrieval strategies (e.g., CSKG-guided RAG and attack-behavior decomposition) yield substantial gains, especially for hybrid and unstructured tasks. The work also identifies failure modes such as semantic drift, retrieval misalignment, and instability in smaller models, highlighting the need for robust, domain-specific augmentation to unlock LLM potential in CTI copilots.

Abstract

Cyber threat intelligence (CTI) is central to modern cybersecurity, providing critical insights for detecting and mitigating evolving threats. With the natural language understanding and reasoning capabilities of large language models (LLMs), there is increasing interest in applying them to CTI, which calls for benchmarks that can rigorously evaluate their performance. Several early efforts have studied LLMs on some CTI tasks but remain limited: (i) they adopt only closed-book settings, relying on parametric knowledge without leveraging CTI knowledge bases; (ii) they cover only a narrow set of tasks, lacking a systematic view of the CTI landscape; and (iii) they restrict evaluation to single-source analysis, unlike realistic scenarios that require reasoning across multiple sources. To fill these gaps, we present CTIArena, the first benchmark for evaluating LLM performance on heterogeneous, multi-source CTI under knowledge-augmented settings. CTIArena spans three categories, structured, unstructured, and hybrid, further divided into nine tasks that capture the breadth of CTI analysis in modern security operations. We evaluate ten widely used LLMs and find that most struggle in closed-book setups but show noticeable gains when augmented with security-specific knowledge through our designed retrieval-augmented techniques. These findings highlight the limitations of general-purpose LLMs and the need for domain-tailored techniques to fully unlock their potential for CTI.

CTIArena: Benchmarking LLM Knowledge and Reasoning Across Heterogeneous Cyber Threat Intelligence

TL;DR

CTIArena presents the first broad benchmark for evaluating LLM knowledge and reasoning over heterogeneous, multi-source CTI with knowledge augmentation. It delineates nine tasks across three CTI categories (structured, unstructured, hybrid) and introduces a principled three-stage pipeline for dataset construction grounded in authoritative CTI sources. Experimental results show that closed-book LLMs struggle on CTI tasks, but domain-tailored retrieval strategies (e.g., CSKG-guided RAG and attack-behavior decomposition) yield substantial gains, especially for hybrid and unstructured tasks. The work also identifies failure modes such as semantic drift, retrieval misalignment, and instability in smaller models, highlighting the need for robust, domain-specific augmentation to unlock LLM potential in CTI copilots.

Abstract

Cyber threat intelligence (CTI) is central to modern cybersecurity, providing critical insights for detecting and mitigating evolving threats. With the natural language understanding and reasoning capabilities of large language models (LLMs), there is increasing interest in applying them to CTI, which calls for benchmarks that can rigorously evaluate their performance. Several early efforts have studied LLMs on some CTI tasks but remain limited: (i) they adopt only closed-book settings, relying on parametric knowledge without leveraging CTI knowledge bases; (ii) they cover only a narrow set of tasks, lacking a systematic view of the CTI landscape; and (iii) they restrict evaluation to single-source analysis, unlike realistic scenarios that require reasoning across multiple sources. To fill these gaps, we present CTIArena, the first benchmark for evaluating LLM performance on heterogeneous, multi-source CTI under knowledge-augmented settings. CTIArena spans three categories, structured, unstructured, and hybrid, further divided into nine tasks that capture the breadth of CTI analysis in modern security operations. We evaluate ten widely used LLMs and find that most struggle in closed-book setups but show noticeable gains when augmented with security-specific knowledge through our designed retrieval-augmented techniques. These findings highlight the limitations of general-purpose LLMs and the need for domain-tailored techniques to fully unlock their potential for CTI.

Paper Structure

This paper contains 53 sections, 1 figure, 10 tables.

Figures (1)

  • Figure 1: Each CTI task in CTIArena is created through a three-stage construction process. The task quality is controlled by the human-LLM collaboration.