Table of Contents
Fetching ...

Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories

Alperen Yildiz, Sin G. Teo, Yiling Lou, Yebo Feng, Chong Wang, Dinil M. Divakaran

TL;DR

This work tackles the realism gap in vulnerability detection by introducing JitVul, a just-in-time benchmark that links each target function to its vulnerability-introducing and fixing commits, enabling interprocedural and pairwise evaluation across 879 CVEs and 91 CWEs. It evaluates Plain LLMs, Dep-Aug LLMs, and ReAct Agents—with prompting strategies like Chain-of-Thought and few-shot examples—across two foundation models, highlighting that ReAct Agents better leverage interprocedural context for pairwise discrimination, though prompting and model limitations remain. The study emphasizes the importance of pairwise evaluation and interprocedural analysis for realistic vulnerability detection, and provides a release of code and data to advance research in this area. The findings point to a need for agent-specific prompting designs, more robust reasoning capabilities, and scalable, robust interprocedural analysis in practical code-repository vulnerability detection.

Abstract

Large Language Models (LLMs) have shown promise in software vulnerability detection, particularly on function-level benchmarks like Devign and BigVul. However, real-world detection requires interprocedural analysis, as vulnerabilities often emerge through multi-hop function calls rather than isolated functions. While repository-level benchmarks like ReposVul and VulEval introduce interprocedural context, they remain computationally expensive, lack pairwise evaluation of vulnerability fixes, and explore limited context retrieval, limiting their practicality. We introduce JitVul, a JIT vulnerability detection benchmark linking each function to its vulnerability-introducing and fixing commits. Built from 879 CVEs spanning 91 vulnerability types, JitVul enables comprehensive evaluation of detection capabilities. Our results show that ReAct Agents, leveraging thought-action-observation and interprocedural context, perform better than LLMs in distinguishing vulnerable from benign code. While prompting strategies like Chain-of-Thought help LLMs, ReAct Agents require further refinement. Both methods show inconsistencies, either misidentifying vulnerabilities or over-analyzing security guards, indicating significant room for improvement.

Benchmarking LLMs and LLM-based Agents in Practical Vulnerability Detection for Code Repositories

TL;DR

This work tackles the realism gap in vulnerability detection by introducing JitVul, a just-in-time benchmark that links each target function to its vulnerability-introducing and fixing commits, enabling interprocedural and pairwise evaluation across 879 CVEs and 91 CWEs. It evaluates Plain LLMs, Dep-Aug LLMs, and ReAct Agents—with prompting strategies like Chain-of-Thought and few-shot examples—across two foundation models, highlighting that ReAct Agents better leverage interprocedural context for pairwise discrimination, though prompting and model limitations remain. The study emphasizes the importance of pairwise evaluation and interprocedural analysis for realistic vulnerability detection, and provides a release of code and data to advance research in this area. The findings point to a need for agent-specific prompting designs, more robust reasoning capabilities, and scalable, robust interprocedural analysis in practical code-repository vulnerability detection.

Abstract

Large Language Models (LLMs) have shown promise in software vulnerability detection, particularly on function-level benchmarks like Devign and BigVul. However, real-world detection requires interprocedural analysis, as vulnerabilities often emerge through multi-hop function calls rather than isolated functions. While repository-level benchmarks like ReposVul and VulEval introduce interprocedural context, they remain computationally expensive, lack pairwise evaluation of vulnerability fixes, and explore limited context retrieval, limiting their practicality. We introduce JitVul, a JIT vulnerability detection benchmark linking each function to its vulnerability-introducing and fixing commits. Built from 879 CVEs spanning 91 vulnerability types, JitVul enables comprehensive evaluation of detection capabilities. Our results show that ReAct Agents, leveraging thought-action-observation and interprocedural context, perform better than LLMs in distinguishing vulnerable from benign code. While prompting strategies like Chain-of-Thought help LLMs, ReAct Agents require further refinement. Both methods show inconsistencies, either misidentifying vulnerabilities or over-analyzing security guards, indicating significant room for improvement.

Paper Structure

This paper contains 29 sections, 3 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Construction process of JitVul.
  • Figure 2: Prompt templates used with Plain LLM.
  • Figure 3: Workflow of ReAct Agent for JIT vulnerability detection.
  • Figure 4: Prompt template used with ReAct Agent.
  • Figure 5: A FS example of "CWE-787: Out-of-bounds Write", including both vulnerable version and benign version.
  • ...and 3 more figures