Table of Contents
Fetching ...

How Far Have We Gone in Vulnerability Detection Using Large Language Models

Zeyu Gao, Hao Wang, Yuchen Zhou, Wenyu Zhu, Chao Zhang

TL;DR

VulBench provides a high-quality, cross-source benchmark for evaluating vulnerability detection with large language models, combining CTF, MAGMA, Devign, D2A, and Big-Vul alongside natural-language root-cause descriptions. The study evaluates 16 LLMs and 6 baselines on binary and multi-class vulnerability tasks, revealing GPT-4’s strong performance on simplified data but limited real-world transfer, and highlighting benefits of context, decompiled-code readability, and few-shot prompting. Real-world datasets expose persistent gaps, suggesting that LLMs should be integrated with traditional vulnerability-detection tools (e.g., fuzzing, static analyzers) to achieve robust results. Overall, the work demonstrates untapped potential for LLMs in software security while outlining concrete directions for dataset quality, evaluation, and hybrid approaches to handle complex projects at scale.

Abstract

As software becomes increasingly complex and prone to vulnerabilities, automated vulnerability detection is critically important, yet challenging. Given the significant successes of large language models (LLMs) in various tasks, there is growing anticipation of their efficacy in vulnerability detection. However, a quantitative understanding of their potential in vulnerability detection is still missing. To bridge this gap, we introduce a comprehensive vulnerability benchmark VulBench. This benchmark aggregates high-quality data from a wide range of CTF (Capture-the-Flag) challenges and real-world applications, with annotations for each vulnerable function detailing the vulnerability type and its root cause. Through our experiments encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models and static analyzers, we find that several LLMs outperform traditional deep learning approaches in vulnerability detection, revealing an untapped potential in LLMs. This work contributes to the understanding and utilization of LLMs for enhanced software security.

How Far Have We Gone in Vulnerability Detection Using Large Language Models

TL;DR

VulBench provides a high-quality, cross-source benchmark for evaluating vulnerability detection with large language models, combining CTF, MAGMA, Devign, D2A, and Big-Vul alongside natural-language root-cause descriptions. The study evaluates 16 LLMs and 6 baselines on binary and multi-class vulnerability tasks, revealing GPT-4’s strong performance on simplified data but limited real-world transfer, and highlighting benefits of context, decompiled-code readability, and few-shot prompting. Real-world datasets expose persistent gaps, suggesting that LLMs should be integrated with traditional vulnerability-detection tools (e.g., fuzzing, static analyzers) to achieve robust results. Overall, the work demonstrates untapped potential for LLMs in software security while outlining concrete directions for dataset quality, evaluation, and hybrid approaches to handle complex projects at scale.

Abstract

As software becomes increasingly complex and prone to vulnerabilities, automated vulnerability detection is critically important, yet challenging. Given the significant successes of large language models (LLMs) in various tasks, there is growing anticipation of their efficacy in vulnerability detection. However, a quantitative understanding of their potential in vulnerability detection is still missing. To bridge this gap, we introduce a comprehensive vulnerability benchmark VulBench. This benchmark aggregates high-quality data from a wide range of CTF (Capture-the-Flag) challenges and real-world applications, with annotations for each vulnerable function detailing the vulnerability type and its root cause. Through our experiments encompassing 16 LLMs and 6 state-of-the-art (SOTA) deep learning-based models and static analyzers, we find that several LLMs outperform traditional deep learning approaches in vulnerability detection, revealing an untapped potential in LLMs. This work contributes to the understanding and utilization of LLMs for enhanced software security.
Paper Structure (47 sections, 11 figures, 13 tables)

This paper contains 47 sections, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Few-shot conversation for binary classification (whether a function is vulnerable) and multi-class classification (which vulnerability does the function have) in CTF dataset. BC stands for binary classification. MC stands for multi-class classification.
  • Figure 2: Few-shot conversation for binary classification (whether a function is vulnerable) and multi-class classification (which vulnerability does the function have) in the real-world dataset.
  • Figure 3: Few-shot conversation for binary classification (whether a function is vulnerable). Texts in the green box are the queries, and texts in the yellow box are the model's answers.
  • Figure 4: Few-shot conversation for multi-class classification (determine a function's vulnerability type). Texts in the green box are the queries, and texts in the yellow box are the model's answers.
  • Figure 5: Confusion matrix of different models on the CTF dataset
  • ...and 6 more figures