CPE-Identifier: Automated CPE identification and CVE summaries annotation with Deep Learning and NLP

Wanyu Hu; Vrizlynn L. L. Thing

CPE-Identifier: Automated CPE identification and CVE summaries annotation with Deep Learning and NLP

Wanyu Hu, Vrizlynn L. L. Thing

TL;DR

The paper tackles the growing workload of mapping CPEs to CVE summaries in the NVD by introducing CPE-Identifier, an automated CPE annotator and extractor that leverages state-of-the-art NLP models for NER. It combines a data-annotating phase (finetuned BERT) with data-augmentation (DistilRoBERTa) and evaluates three strong NER architectures (BERT, XLNet, GPT-2) on a large cybersecurity-specific corpus, achieving a best F1 of $95.48\%$ and accuracy of $99.13\%$. A Streamlit-based GUI enables interactive CVE text annotation with color-coded CPE entities, and the pipeline outputs labeled data to accelerate vulnerability management. The work demonstrates substantial improvements over prior methods ( >9\% across metrics) and provides a practical tool to expedite CVE-CPE labeling, with implications for faster vulnerability assessment and risk measurement in security operations.

Abstract

With the drastic increase in the number of new vulnerabilities in the National Vulnerability Database (NVD) every year, the workload for NVD analysts to associate the Common Platform Enumeration (CPE) with the Common Vulnerabilities and Exposures (CVE) summaries becomes increasingly laborious and slow. The delay causes organisations, which depend on NVD for vulnerability management and security measurement, to be more vulnerable to zero-day attacks. Thus, it is essential to come out with a technique and tool to extract the CPEs in the CVE summaries accurately and quickly. In this work, we propose the CPE-Identifier system, an automated CPE annotating and extracting system, from the CVE summaries. The system can be used as a tool to identify CPE entities from new CVE text inputs. Moreover, we also automate the data generating and labeling processes using deep learning models. Due to the complexity of the CVE texts, new technical terminologies appear frequently. To identify novel words in future CVE texts, we apply Natural Language Processing (NLP) Named Entity Recognition (NER), to identify new technical jargons in the text. Our proposed model achieves an F1 score of 95.48%, an accuracy score of 99.13%, a precision of 94.83%, and a recall of 96.14%. We show that it outperforms prior works on automated CVE-CPE labeling by more than 9% on all metrics.

CPE-Identifier: Automated CPE identification and CVE summaries annotation with Deep Learning and NLP

TL;DR

and accuracy of

. A Streamlit-based GUI enables interactive CVE text annotation with color-coded CPE entities, and the pipeline outputs labeled data to accelerate vulnerability management. The work demonstrates substantial improvements over prior methods ( >9\% across metrics) and provides a practical tool to expedite CVE-CPE labeling, with implications for faster vulnerability assessment and risk measurement in security operations.

Abstract

Paper Structure (32 sections, 4 equations, 21 figures)

This paper contains 32 sections, 4 equations, 21 figures.

INTRODUCTION
LITERATURE REVIEW
Research Methodology
System Design
Proposed Approach
Dataset and Labels
CPE data
Padding/Trimming the sentences
Tagging Schemes
Models
BERT model
XLNet model
GPT-2 model
Implementation
Data Pre-processing Stage
...and 17 more sections

Figures (21)

Figure 1: CPE-Identifier Design
Figure 2: Models Training Process Design
Figure 3: Challenges and Proposed Solutions
Figure 4: Sentence Length Distribution
Figure 5: XLNet Permutation Language Modelling (PLM)
...and 16 more figures

CPE-Identifier: Automated CPE identification and CVE summaries annotation with Deep Learning and NLP

TL;DR

Abstract

CPE-Identifier: Automated CPE identification and CVE summaries annotation with Deep Learning and NLP

Authors

TL;DR

Abstract

Table of Contents

Figures (21)