Table of Contents
Fetching ...

VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

Yu Liu, Lang Gao, Mingxin Yang, Yu Xie, Ping Chen, Xiaojin Zhang, Wei Chen

TL;DR

VulDetectBench introduces a five-task multi-task benchmark to evaluate LLMs on vulnerability detection in large C/C++ codebases, spanning existence, CWE type inference, and fine-grained localization (root cause and trigger points). The study evaluates 17 LLMs and finds strong performance on simple presence and type inference tasks but pronounced gaps in detailed vulnerability analysis and precise code localization. Through extensive analysis, the authors show model size and prompting strategies influence results and highlight the limitations of current LLMs in deep vulnerability understanding, while comparing against traditional static analysis tools. The benchmark provides a foundation for future improvements in automated vulnerability analysis and guides responsible deployment of LLMs in security workflows.

Abstract

Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challenges in vulnerability analysis, our study introduces a new benchmark, VulDetectBench, specifically designed to assess the vulnerability detection capabilities of LLMs. The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty. We evaluate the performance of 17 models (both open- and closed-source) and find that while existing models can achieve over 80% accuracy on tasks related to vulnerability identification and classification, they still fall short on specific, more detailed vulnerability analysis tasks, with less than 30% accuracy, making it difficult to provide valuable auxiliary information for professional vulnerability mining. Our benchmark effectively evaluates the capabilities of various LLMs at different levels in the specific task of vulnerability detection, providing a foundation for future research and improvements in this critical area of code security. VulDetectBench is publicly available at https://github.com/Sweetaroo/VulDetectBench.

VulDetectBench: Evaluating the Deep Capability of Vulnerability Detection with Large Language Models

TL;DR

VulDetectBench introduces a five-task multi-task benchmark to evaluate LLMs on vulnerability detection in large C/C++ codebases, spanning existence, CWE type inference, and fine-grained localization (root cause and trigger points). The study evaluates 17 LLMs and finds strong performance on simple presence and type inference tasks but pronounced gaps in detailed vulnerability analysis and precise code localization. Through extensive analysis, the authors show model size and prompting strategies influence results and highlight the limitations of current LLMs in deep vulnerability understanding, while comparing against traditional static analysis tools. The benchmark provides a foundation for future improvements in automated vulnerability analysis and guides responsible deployment of LLMs in security workflows.

Abstract

Large Language Models (LLMs) have training corpora containing large amounts of program code, greatly improving the model's code comprehension and generation capabilities. However, sound comprehensive research on detecting program vulnerabilities, a more specific task related to code, and evaluating the performance of LLMs in this more specialized scenario is still lacking. To address common challenges in vulnerability analysis, our study introduces a new benchmark, VulDetectBench, specifically designed to assess the vulnerability detection capabilities of LLMs. The benchmark comprehensively evaluates LLM's ability to identify, classify, and locate vulnerabilities through five tasks of increasing difficulty. We evaluate the performance of 17 models (both open- and closed-source) and find that while existing models can achieve over 80% accuracy on tasks related to vulnerability identification and classification, they still fall short on specific, more detailed vulnerability analysis tasks, with less than 30% accuracy, making it difficult to provide valuable auxiliary information for professional vulnerability mining. Our benchmark effectively evaluates the capabilities of various LLMs at different levels in the specific task of vulnerability detection, providing a foundation for future research and improvements in this critical area of code security. VulDetectBench is publicly available at https://github.com/Sweetaroo/VulDetectBench.
Paper Structure (30 sections, 2 equations, 11 figures, 13 tables)

This paper contains 30 sections, 2 equations, 11 figures, 13 tables.

Figures (11)

  • Figure 1: Top 8 LLMs' ability on Vulnerability Detections. Our benchmark consisting of five vulnerability analysis related tasks of increasing difficulty. The figure shows that existing LLMs perform well on simple analysis tasks such as vulnerability existence detection and CWE type inference, while on specific vulnerability related tasks, although performance varies from LLM to LLM, the overall performance is not yet satisfactory.
  • Figure 2: Overview of VulDetectBench Construction. We collect publicly available vulnerability databases to form the initial dataset and design five tasks with increasing difficulty to build the benchmark by analyzing the vulnerabilities. We ensure the quality of the benchmark through data processing, including the compatibility of the context size limit, the deletion of obvious hints, and the distribution of the vulnerability types, and at the same time construct the ground truth answer. Next, we construct the prompts corresponding to the five tasks, input them to 17 LLMs for evaluation, and finally derive the analysis results.
  • Figure 3: Performance comparison of different sizes within the same LLM family
  • Figure 4: Performance comparison. \ref{['fig:subfig1']} represents the performance on 100 samples of different tasks. \ref{['fig:subfig2']} represents the performance of the model given that the previous task was performed correctly.(The quantity in Task 2 is the number that Task 1 gets right.)
  • Figure 5: CWE Distribution of VulDetecBench among 5 Tasks
  • ...and 6 more figures