AttackER: Towards Enhancing Cyber-Attack Attribution with a Named Entity Recognition Dataset
Pritam Deka, Sampath Rajapaksha, Ruby Rani, Amirah Almutairi, Erisa Karafili
TL;DR
AttackER introduces the first NER-centric dataset for cyber-attack attribution, linking 18 STIX 2.1-aligned entity types across 2640 annotated sentences from 217 sources. It compares spaCy, Huggingface transformers, and instruction-tuned LLMs (GPT-3.5, Llama-2, Mistral-7B) on the NER task, highlighting strong gains from annotation quality, fine-tuning, and prompt design. The work provides dataset and model releases, demonstrates the feasibility of automated attribution assistance, and discusses ground-truth alignment as a key factor in evaluating LLM performance. Future work aims to scale the dataset, extract relationships, and build end-to-end tools for attribution support and cyber-investigation QA.
Abstract
Cyber-attack attribution is an important process that allows experts to put in place attacker-oriented countermeasures and legal actions. The analysts mainly perform attribution manually, given the complex nature of this task. AI and, more specifically, Natural Language Processing (NLP) techniques can be leveraged to support cybersecurity analysts during the attribution process. However powerful these techniques are, they need to deal with the lack of datasets in the attack attribution domain. In this work, we will fill this gap and will provide, to the best of our knowledge, the first dataset on cyber-attack attribution. We designed our dataset with the primary goal of extracting attack attribution information from cybersecurity texts, utilizing named entity recognition (NER) methodologies from the field of NLP. Unlike other cybersecurity NER datasets, ours offers a rich set of annotations with contextual details, including some that span phrases and sentences. We conducted extensive experiments and applied NLP techniques to demonstrate the dataset's effectiveness for attack attribution. These experiments highlight the potential of Large Language Models (LLMs) capabilities to improve the NER tasks in cybersecurity datasets for cyber-attack attribution.
