Automatic Simplification of Common Vulnerabilities and Exposures Descriptions
Varpu Vehomäki, Kimmo K. Kaski
TL;DR
This study investigates automatic text simplification (ATS) of CVE descriptions using large language models, aiming to make cybersecurity reports more accessible to non-experts. It creates a semi-synthetic 40-CVE test set, develops an agent-based GemmaAgent pipeline with NER term extraction and a retrieval-augmented generation component, and evaluates multiple models with both automatic metrics and two rounds of human expert surveys. Results indicate that while out-of-the-box LLMs can simplify text, meaning preservation remains problematic; retrieval-augmented and domain-lexicon–driven approaches show some gains in meaning, whereas simplicity gains alone can come at the cost of accuracy. The work highlights the need for larger, domain-specific data, careful prompting, and robust human-in-the-loop evaluation to produce reliable, accessible CVE descriptions for cybersecurity decision-makers.
Abstract
Understanding cyber security is increasingly important for individuals and organizations. However, a lot of information related to cyber security can be difficult to understand to those not familiar with the topic. In this study, we focus on investigating how large language models (LLMs) could be utilized in automatic text simplification (ATS) of Common Vulnerability and Exposure (CVE) descriptions. Automatic text simplification has been studied in several contexts, such as medical, scientific, and news texts, but it has not yet been studied to simplify texts in the rapidly changing and complex domain of cyber security. We created a baseline for cyber security ATS and a test dataset of 40 CVE descriptions, evaluated by two groups of cyber security experts in two survey rounds. We have found that while out-of-the box LLMs can make the text appear simpler, they struggle with meaning preservation. Code and data are available at https://version.aalto.fi/gitlab/vehomav1/simplification\_nmi.
