Table of Contents
Fetching ...

Semantic Ranking for Automated Adversarial Technique Annotation in Security Text

Udesh Kumarasinghe, Ahmed Lekssays, Husrev Taha Sencar, Sabri Boughorbel, Charitha Elvitigala, Preslav Nakov

TL;DR

This work tackles the challenge of mapping threat intelligence text to MITRE ATT&CK techniques by reframing it as a learning-to-rank problem. It presents a three-stage ranking pipeline that progressively refines candidates via BM25, a domain-tuned bi-encoder (SentSecBERT), and a mono-encoder (monoT5), leveraging a newly created public dataset of 6.6k threat-behavior descriptions linked to ATT&CK techniques. The approach achieves state-of-the-art results, with a top-3 recall of $81\%$ among $193$ techniques, and significantly outperforms zero-shot large language models by about $40\%$. The work also provides extensive ablations and cross-dataset analyses, showing the importance of domain-specific fine-tuning and reliable preprocessing, and it makes a public dataset available to accelerate future research in automated threat-behavior annotation.

Abstract

We introduce a new method for extracting structured threat behaviors from threat intelligence text. Our method is based on a multi-stage ranking architecture that allows jointly optimizing for efficiency and effectiveness. Therefore, we believe this problem formulation better aligns with the real-world nature of the task considering the large number of adversary techniques and the extensive body of threat intelligence created by security analysts. Our findings show that the proposed system yields state-of-the-art performance results for this task. Results show that our method has a top-3 recall performance of 81\% in identifying the relevant technique among 193 top-level techniques. Our tests also demonstrate that our system performs significantly better (+40\%) than the widely used large language models when tested under a zero-shot setting.

Semantic Ranking for Automated Adversarial Technique Annotation in Security Text

TL;DR

This work tackles the challenge of mapping threat intelligence text to MITRE ATT&CK techniques by reframing it as a learning-to-rank problem. It presents a three-stage ranking pipeline that progressively refines candidates via BM25, a domain-tuned bi-encoder (SentSecBERT), and a mono-encoder (monoT5), leveraging a newly created public dataset of 6.6k threat-behavior descriptions linked to ATT&CK techniques. The approach achieves state-of-the-art results, with a top-3 recall of among techniques, and significantly outperforms zero-shot large language models by about . The work also provides extensive ablations and cross-dataset analyses, showing the importance of domain-specific fine-tuning and reliable preprocessing, and it makes a public dataset available to accelerate future research in automated threat-behavior annotation.

Abstract

We introduce a new method for extracting structured threat behaviors from threat intelligence text. Our method is based on a multi-stage ranking architecture that allows jointly optimizing for efficiency and effectiveness. Therefore, we believe this problem formulation better aligns with the real-world nature of the task considering the large number of adversary techniques and the extensive body of threat intelligence created by security analysts. Our findings show that the proposed system yields state-of-the-art performance results for this task. Results show that our method has a top-3 recall performance of 81\% in identifying the relevant technique among 193 top-level techniques. Our tests also demonstrate that our system performs significantly better (+40\%) than the widely used large language models when tested under a zero-shot setting.
Paper Structure (36 sections, 3 equations, 7 figures, 13 tables)

This paper contains 36 sections, 3 equations, 7 figures, 13 tables.

Figures (7)

  • Figure 1: (a) The practical application of our proposed system in annotating threat reports, (b) Examples of report sentences with corresponding predicted techniques.
  • Figure 2: System Architecture of the Proposed Method ($K_i$ represents the number of candidate technique IDs at the output of the $i$-th stage, with $K_1 > K_2 > K_3$).
  • Figure 3: The effect of raising the cutoff threshold from 50 to 150 in the Stage-1 ranker on the Stage-2 ranker's performance is depicted. The blue curve represents the decrease in recall performance, while the yellow curve illustrates the increase in computational time at the Stage-2 ranker for processing each query behavior.
  • Figure 4: Impact of increasing the number of few-shot learning examples on the measured Recall@3 score.
  • Figure 5: (a) Recall at top 3 vs the number of training samples available for each technique, (b) Recall at top 3 vs MITRE ATT&CK document size (in characters) for each technique.
  • ...and 2 more figures