Can LLMs Classify CVEs? Investigating LLMs Capabilities in Computing CVSS Vectors
Francesco Marchiori, Denis Donadel, Mauro Conti
TL;DR
This work assesses whether general-purpose LLMs can generate CVSS vectors from CVE descriptions and compares them to embedding-based methods. It analyzes vanilla LLM prompting, CWE augmentation, embedding-based classification, and a hybrid LLM-embedding approach on a CVE dataset with CVSS v3.1. The results show LLMs perform well on objective components but struggle with subjective ones, while embeddings perform better on subjective aspects; the hybrid approach achieves the best overall accuracy (~0.835). The study publicizes open-source code and demonstrates that combining linguistic reasoning with learned representations offers the most reliable CVSS scoring in high-volume vulnerability pipelines.
Abstract
Common Vulnerability and Exposure (CVE) records are fundamental to cybersecurity, offering unique identifiers for publicly known software and system vulnerabilities. Each CVE is typically assigned a Common Vulnerability Scoring System (CVSS) score to support risk prioritization and remediation. However, score inconsistencies often arise due to subjective interpretations of certain metrics. As the number of new CVEs continues to grow rapidly, automation is increasingly necessary to ensure timely and consistent scoring. While prior studies have explored automated methods, the application of Large Language Models (LLMs), despite their recent popularity, remains relatively underexplored. In this work, we evaluate the effectiveness of LLMs in generating CVSS scores for newly reported vulnerabilities. We investigate various prompt engineering strategies to enhance their accuracy and compare LLM-generated scores against those from embedding-based models, which use vector representations classified via supervised learning. Our results show that while LLMs demonstrate potential in automating CVSS evaluation, embedding-based methods outperform them in scoring more subjective components, particularly confidentiality, integrity, and availability impacts. These findings underscore the complexity of CVSS scoring and suggest that combining LLMs with embedding-based methods could yield more reliable results across all scoring components.
