A Comparison of Vulnerability Feature Extraction Methods from Textual Attack Patterns

Refat Othman; Bruno Rossi; Russo Barbara

A Comparison of Vulnerability Feature Extraction Methods from Textual Attack Patterns

Refat Othman, Bruno Rossi, Russo Barbara

TL;DR

The paper addresses extracting vulnerability information from unstructured attack-pattern reports to link to CVEs. It compares five feature-extraction methods—TF-IDF, LSI, BERT, MiniLM, and RoBERTa—across multiple classifiers using a novel VULDAP dataset that maps MITRE CAPEC/CWE to CVEs. TF-IDF yields the best overall performance, guiding method selection for text-to-CVE extraction in threat intelligence pipelines. The work provides a dataset, an evaluation framework, and actionable insights to advance automated vulnerability extraction in cyber threat intelligence.

Abstract

Nowadays, threat reports from cybersecurity vendors incorporate detailed descriptions of attacks within unstructured text. Knowing vulnerabilities that are related to these reports helps cybersecurity researchers and practitioners understand and adjust to evolving attacks and develop mitigation plans. This paper aims to aid cybersecurity researchers and practitioners in choosing attack extraction methods to enhance the monitoring and sharing of threat intelligence. In this work, we examine five feature extraction methods (TF-IDF, LSI, BERT, MiniLM, RoBERTa) and find that Term Frequency-Inverse Document Frequency (TF-IDF) outperforms the other four methods with a precision of 75\% and an F1 score of 64\%. The findings offer valuable insights to the cybersecurity community, and our research can aid cybersecurity researchers in evaluating and comparing the effectiveness of upcoming extraction methods.

A Comparison of Vulnerability Feature Extraction Methods from Textual Attack Patterns

TL;DR

Abstract

Paper Structure (14 sections, 5 figures, 2 tables)

This paper contains 14 sections, 5 figures, 2 tables.

Introduction
Study Design and Methodology
Dataset Collection
Text Pre-Processing
Feature Extraction
Oversampling
Classification
Classification Settings and Cross Validation
Preliminary Results
RQ: How do different feature extraction methods compare in terms of performance when classifying textual descriptions of attack patterns to CVE issues across different classifiers?
Threats to Validity
Related Work
Conclusion
Acknowledgements

Figures (5)

Figure 1: Methodology Overview
Figure 2: MITRE repositories and their connections
Figure 3: Text pre-processing
Figure 4: Boxplot of F1 score for five methods
Figure 5: Boxplot of five methods on performance metrics

A Comparison of Vulnerability Feature Extraction Methods from Textual Attack Patterns

TL;DR

Abstract

A Comparison of Vulnerability Feature Extraction Methods from Textual Attack Patterns

Authors

TL;DR

Abstract

Table of Contents

Figures (5)