How Far Have We Gone in Vulnerability Detection Using Large Language Models
Zeyu Gao, Hao Wang, Yuchen Zhou, Wenyu Zhu, Chao Zhang
TL;DR
VulBench provides a high-quality, cross-source benchmark for evaluating vulnerability detection with large language models, combining CTF, MAGMA, Devign, D2A, and Big-Vul alongside natural-language root-cause descriptions. The study evaluates 16 LLMs and 6 baselines on binary and multi-class vulnerability tasks, revealing GPT-4’s strong performance on simplified data but limited real-world transfer, and highlighting benefits of context, decompiled-code readability, and few-shot prompting. Real-world datasets expose persistent gaps, suggesting that LLMs should be integrated with traditional vulnerability-detection tools (e.g., fuzzing, static analyzers) to achieve robust results. Overall, the work demonstrates untapped potential for LLMs in software security while outlining concrete directions for dataset quality, evaluation, and hybrid approaches to handle complex projects at scale.
Abstract
As software becomes increasingly complex and prone to vulnerabilities, automated vulnerability detection is critically important, yet challenging. Given the significant successes of large language models (LLMs) in various tasks, there is growing anticipation of their efficacy in vulnerability detection. However, a quantitative understanding of their potential in vulnerability detection is still missing. To bridge this gap, we introduce a comprehensive vulnerability benchmark VulBench. This benchmark aggregates high-quality data from a wide range of CTF (Capture-the-Flag) challenges and real-world applications, with annotations for each vulnerable function detailing the vulnerability type and its root cause. Through our experiments encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models and static analyzers, we find that several LLMs outperform traditional deep learning approaches in vulnerability detection, revealing an untapped potential in LLMs. This work contributes to the understanding and utilization of LLMs for enhanced software security.
