Table of Contents
Fetching ...

AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat Reports

Lukas Lange, Marc Müller, Ghazaleh Haratinezhad Torbati, Dragan Milchevski, Patrick Grau, Subhash Pujari, Annemarie Friedrich

TL;DR

AnnoCTR tackles the lack of open, richly annotated cyber threat data by introducing a CC-BY-SA dataset of 400 cyber threat reports, with 120 reports mapped to MITRE ATT&CK concepts and Wikipedia entities. The work defines four NLP tasks—NER, temporal tagging, entity disambiguation, and sentence-level tactic/technique classification—and benchmarks state-of-the-art transformer models, domain-adapted temporal tagging, and KB linking approaches (BLINK and GENRE). Key findings show RoBERTa-based NER excels in general and cybersecurity entity recognition, while domain-specific fine-tuning and data augmentation via MITRE ATT&CK descriptions significantly improve disambiguation and classification of techniques and tactics. The dataset and baselines enable robust, open research in cybersecurity NLP and CTI tooling, with practical impact for improved threat intelligence analysis and search capabilities based on MITRE ATT&CK mappings and knowledge-base linking.

Abstract

Monitoring the threat landscape to be aware of actual or potential attacks is of utmost importance to cybersecurity professionals. Information about cyber threats is typically distributed using natural language reports. Natural language processing can help with managing this large amount of unstructured information, yet to date, the topic has received little attention. With this paper, we present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports. The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts including implicitly mentioned techniques and tactics. Entities and concepts are linked to Wikipedia and the MITRE ATT&CK knowledge base, the most widely-used taxonomy for classifying types of attacks. Prior datasets linking to MITRE ATT&CK either provide a single label per document or annotate sentences out-of-context; our dataset annotates entire documents in a much finer-grained way. In an experimental study, we model the annotations of our dataset using state-of-the-art neural models. In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation.

AnnoCTR: A Dataset for Detecting and Linking Entities, Tactics, and Techniques in Cyber Threat Reports

TL;DR

AnnoCTR tackles the lack of open, richly annotated cyber threat data by introducing a CC-BY-SA dataset of 400 cyber threat reports, with 120 reports mapped to MITRE ATT&CK concepts and Wikipedia entities. The work defines four NLP tasks—NER, temporal tagging, entity disambiguation, and sentence-level tactic/technique classification—and benchmarks state-of-the-art transformer models, domain-adapted temporal tagging, and KB linking approaches (BLINK and GENRE). Key findings show RoBERTa-based NER excels in general and cybersecurity entity recognition, while domain-specific fine-tuning and data augmentation via MITRE ATT&CK descriptions significantly improve disambiguation and classification of techniques and tactics. The dataset and baselines enable robust, open research in cybersecurity NLP and CTI tooling, with practical impact for improved threat intelligence analysis and search capabilities based on MITRE ATT&CK mappings and knowledge-base linking.

Abstract

Monitoring the threat landscape to be aware of actual or potential attacks is of utmost importance to cybersecurity professionals. Information about cyber threats is typically distributed using natural language reports. Natural language processing can help with managing this large amount of unstructured information, yet to date, the topic has received little attention. With this paper, we present AnnoCTR, a new CC-BY-SA-licensed dataset of cyber threat reports. The reports have been annotated by a domain expert with named entities, temporal expressions, and cybersecurity-specific concepts including implicitly mentioned techniques and tactics. Entities and concepts are linked to Wikipedia and the MITRE ATT&CK knowledge base, the most widely-used taxonomy for classifying types of attacks. Prior datasets linking to MITRE ATT&CK either provide a single label per document or annotate sentences out-of-context; our dataset annotates entire documents in a much finer-grained way. In an experimental study, we model the annotations of our dataset using state-of-the-art neural models. In our few-shot scenario, we find that for identifying the MITRE ATT&CK concepts that are mentioned explicitly or implicitly in a text, concept descriptions from MITRE ATT&CK are an effective source for training data augmentation.
Paper Structure (28 sections, 4 figures, 11 tables)

This paper contains 28 sections, 4 figures, 11 tables.

Figures (4)

  • Figure 1: AnnoCTR is a CC-BY-SA-licensed dataset of 120 cyber threat reports annotated with MITRE ATT&CK concepts and WikiData entities.
  • Figure 2: MITRE ATT&CK (sub)techniques.
  • Figure 3: Distribution of technique links annotated in AnnoCTR, showing techniques that occur at least 5 times. In addition, there is a long tail of 136 instances annotated with techniques occurring less frequently.
  • Figure 4: Distribution of tactic links annotated in AnnoCTR.