Table of Contents
Fetching ...

Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask

Yue Li, Xiao Li, Hao Wu, Minghui Xu, Yue Zhang, Xiuzhen Cheng, Fengyuan Xu, Sheng Zhong

TL;DR

The paper investigates whether LLM-based vulnerability detection truly benefits from real-world context, addressing the gap where prior benchmarks treated code in isolation. It introduces CORRECT, a context-rich evaluation framework that pairs 2,000 vulnerable-patched programs across 99 CWEs with dual prompts and LLM-as-a-judge to assess both detections and rationales. Across 13 LLMs and four model families, CORRECT reveals that when provided with sufficient context, LLMs achieve meaningful detection performance and discriminability, though most remaining errors arise from reasoning challenges and generalization limits. The work demonstrates that context is essential for accurate vulnerability reasoning, highlights scaling trade-offs, and advocates re-evaluating standard benchmarks to develop practical, context-aware vulnerability detection tools.

Abstract

Large Language Models are a promising tool for automated vulnerability detection, thanks to their success in code generation and repair. However, despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities? Current evaluations, which often assess models on isolated functions or files, ignore the broader execution and data-flow context essential for understanding vulnerabilities. This oversight leads to two types of misleading outcomes: incorrect conclusions and flawed rationales, collectively undermining the reliability of prior assessments. Therefore, in this paper, we challenge three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales. We argue that these beliefs are artifacts of context-deprived evaluations. To address this, we propose CORRECT (Context-Rich Reasoning Evaluation of Code with Trust), a new evaluation framework that systematically incorporates contextual information into LLM-based vulnerability detection. We construct a context-rich dataset of 2,000 vulnerable-patched program pairs spanning 99 CWEs and evaluate 13 LLMs across four model families. Our framework elicits both binary predictions and natural-language rationales, which are further validated using LLM-as-a-judge techniques. Our findings overturn existing misconceptions. When provided with sufficient context, SOTA LLMs achieve significantly improved performance (e.g., 0.7 F1-score on key CWEs), with 0.8 precision. We show that most false positives stem from reasoning errors rather than misclassification, and that while model and test-time scaling improve performance, they introduce diminishing returns and trade-offs in recall. Finally, we uncover new flaws in current LLM-based detection systems, such as limited generalization and overthinking biases.

Everything You Wanted to Know About LLM-based Vulnerability Detection But Were Afraid to Ask

TL;DR

The paper investigates whether LLM-based vulnerability detection truly benefits from real-world context, addressing the gap where prior benchmarks treated code in isolation. It introduces CORRECT, a context-rich evaluation framework that pairs 2,000 vulnerable-patched programs across 99 CWEs with dual prompts and LLM-as-a-judge to assess both detections and rationales. Across 13 LLMs and four model families, CORRECT reveals that when provided with sufficient context, LLMs achieve meaningful detection performance and discriminability, though most remaining errors arise from reasoning challenges and generalization limits. The work demonstrates that context is essential for accurate vulnerability reasoning, highlights scaling trade-offs, and advocates re-evaluating standard benchmarks to develop practical, context-aware vulnerability detection tools.

Abstract

Large Language Models are a promising tool for automated vulnerability detection, thanks to their success in code generation and repair. However, despite widespread adoption, a critical question remains: Are LLMs truly effective at detecting real-world vulnerabilities? Current evaluations, which often assess models on isolated functions or files, ignore the broader execution and data-flow context essential for understanding vulnerabilities. This oversight leads to two types of misleading outcomes: incorrect conclusions and flawed rationales, collectively undermining the reliability of prior assessments. Therefore, in this paper, we challenge three widely held community beliefs: that LLMs are (i) unreliable, (ii) insensitive to code patches, and (iii) performance-plateaued across model scales. We argue that these beliefs are artifacts of context-deprived evaluations. To address this, we propose CORRECT (Context-Rich Reasoning Evaluation of Code with Trust), a new evaluation framework that systematically incorporates contextual information into LLM-based vulnerability detection. We construct a context-rich dataset of 2,000 vulnerable-patched program pairs spanning 99 CWEs and evaluate 13 LLMs across four model families. Our framework elicits both binary predictions and natural-language rationales, which are further validated using LLM-as-a-judge techniques. Our findings overturn existing misconceptions. When provided with sufficient context, SOTA LLMs achieve significantly improved performance (e.g., 0.7 F1-score on key CWEs), with 0.8 precision. We show that most false positives stem from reasoning errors rather than misclassification, and that while model and test-time scaling improve performance, they introduce diminishing returns and trade-offs in recall. Finally, we uncover new flaws in current LLM-based detection systems, such as limited generalization and overthinking biases.

Paper Structure

This paper contains 26 sections, 16 figures, 6 tables.

Figures (16)

  • Figure 1: Reasoning with LLMs.
  • Figure 2: (a) CVE-2017-7875, an out-of-bounds write vulnerability. (b) Essential callee context for (a). (c) A reasonable rationale but an incorrect conclusion. (d) A correct conclusion, but a wrong rationale (non-ground-truth vulnerability).
  • Figure 3: The three stages of the CORRECT are as follows: The constructed functions and shared context are used in (b)-I; the rationales generated in (b)-I are evaluated for correctness by (b)-II, which employs LLM-as-a-judge. Finally, the assessment is conducted in (c). In (c)-Lenient Mode, only the key outcomes are displayed in this figure, with further details provided in §\ref{['subsec:assessment']}.
  • Figure 4: Performance comparison across various evaluation settings. (a)-(d): Different metrics in the "w/o context, w/o revision" configurations. (f), (h), (i), (j): Different metrics under Strict Mode. The model r1-qn-7b has been excluded from the analysis due to its excessive abnormal outputs, which substantially impair the clarity and reliability of the results.
  • Figure 5: Pair-wise prediction proportion of all LLMs.
  • ...and 11 more figures