Table of Contents
Fetching ...

CTI-HAL: A Human-Annotated Dataset for Cyber Threat Intelligence Analysis

Sofia Della Penna, Roberto Natella, Vittorio Orbinato, Lorenzo Parracino, Luciano Pianese

TL;DR

CTI-HAL addresses the challenge of extracting structured cyber threat intelligence from unstructured CTI sources by delivering a human-annotated, sentence-level dataset aligned to MITRE ATT&CK. Built from real CTI reports and annotated by two experts, its quality is supported by inter-annotator agreement analyses and cross-textual similarity metrics, covering 116 techniques across diverse APTs. The dataset is employed to evaluate a real-world automation flow using Claude 3 Haiku for TTP extraction, revealing size-dependent performance and strong generalizability to commercial CTI feeds. Overall, CTI-HAL enhances reproducibility, interpretability, and practical applicability of AI-driven CTI analysis for proactive defense.

Abstract

Organizations are increasingly targeted by Advanced Persistent Threats (APTs), which involve complex, multi-stage tactics and diverse techniques. Cyber Threat Intelligence (CTI) sources, such as incident reports and security blogs, provide valuable insights, but are often unstructured and in natural language, making it difficult to automatically extract information. Recent studies have explored the use of AI to perform automatic extraction from CTI data, leveraging existing CTI datasets for performance evaluation and fine-tuning. However, they present challenges and limitations that impact their effectiveness. To overcome these issues, we introduce a novel dataset manually constructed from CTI reports and structured according to the MITRE ATT&CK framework. To assess its quality, we conducted an inter-annotator agreement study using Krippendorff alpha, confirming its reliability. Furthermore, the dataset was used to evaluate a Large Language Model (LLM) in a real-world business context, showing promising generalizability.

CTI-HAL: A Human-Annotated Dataset for Cyber Threat Intelligence Analysis

TL;DR

CTI-HAL addresses the challenge of extracting structured cyber threat intelligence from unstructured CTI sources by delivering a human-annotated, sentence-level dataset aligned to MITRE ATT&CK. Built from real CTI reports and annotated by two experts, its quality is supported by inter-annotator agreement analyses and cross-textual similarity metrics, covering 116 techniques across diverse APTs. The dataset is employed to evaluate a real-world automation flow using Claude 3 Haiku for TTP extraction, revealing size-dependent performance and strong generalizability to commercial CTI feeds. Overall, CTI-HAL enhances reproducibility, interpretability, and practical applicability of AI-driven CTI analysis for proactive defense.

Abstract

Organizations are increasingly targeted by Advanced Persistent Threats (APTs), which involve complex, multi-stage tactics and diverse techniques. Cyber Threat Intelligence (CTI) sources, such as incident reports and security blogs, provide valuable insights, but are often unstructured and in natural language, making it difficult to automatically extract information. Recent studies have explored the use of AI to perform automatic extraction from CTI data, leveraging existing CTI datasets for performance evaluation and fine-tuning. However, they present challenges and limitations that impact their effectiveness. To overcome these issues, we introduce a novel dataset manually constructed from CTI reports and structured according to the MITRE ATT&CK framework. To assess its quality, we conducted an inter-annotator agreement study using Krippendorff alpha, confirming its reliability. Furthermore, the dataset was used to evaluate a Large Language Model (LLM) in a real-world business context, showing promising generalizability.

Paper Structure

This paper contains 12 sections, 3 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Workflow
  • Figure 2: Example of the application of the workflow
  • Figure 3: Distribution of techniques and tools by number of occurrences.
  • Figure 4: APT29 - Krippendorff's $\alpha$
  • Figure 5: Evaluation of CTI extraction