Streamlining Security Vulnerability Triage with Large Language Models
Mohammad Jalili Torkamani, Joey NG, Nikita Mehrotra, Mahinthan Chandramohan, Padmanabhan Krishnan, Rahul Purandare
TL;DR
CASEY addresses the bottleneck in security vulnerability triage by leveraging Large Language Models to automate CWE identification and CVSS-based severity inference from bug reports and code, using prompt engineering and granular context. The authors fine-tune GPT-3.5 on an enhanced NVD corpus and compare against a GPT-3 baseline, demonstrating improvements in CWE accuracy (68%), severity accuracy (73.6%), and combined accuracy (51.2%), with enterprise-ready deployment considerations. The approach relies on a two-dataset setup (evaluation and CVE2CWE) and a multi-stage pipeline that validates data and formats outputs for rigorous evaluation. Empirical findings highlight the importance of bug descriptions and targeted code fragments, while illustrating the trade-offs of including full buggy files due to noise. The work points to practical benefits for vulnerability management, offering a path to internal deployment that preserves data confidentiality and can adapt to evolving CVSS definitions.
Abstract
Bug triaging for security vulnerabilities is a critical part of software maintenance, ensuring that the most pressing vulnerabilities are addressed promptly to safeguard system integrity and user data. However, the process is resource-intensive and comes with challenges, including classifying software vulnerabilities, assessing their severity, and managing a high volume of bug reports. In this paper, we present CASEY, a novel approach that leverages Large Language Models (in our case, the GPT model) that automates the identification of Common Weakness Enumerations (CWEs) of security bugs and assesses their severity. CASEY employs prompt engineering techniques and incorporates contextual information at varying levels of granularity to assist in the bug triaging process. We evaluated CASEY using an augmented version of the National Vulnerability Database (NVD), employing quantitative and qualitative metrics to measure its performance across CWE identification, severity assessment, and their combined analysis. CASEY achieved a CWE identification accuracy of 68%, a severity identification accuracy of 73.6%, and a combined accuracy of 51.2% for identifying both. These results demonstrate the potential of LLMs in identifying CWEs and severity levels, streamlining software vulnerability management, and improving the efficiency of security vulnerability triaging workflows.
