Table of Contents
Fetching ...

To Err is Machine: Vulnerability Detection Challenges LLM Reasoning

Benjamin Steenhoek, Md Mahbubur Rahman, Monoshi Kumar Roy, Mirza Sanjida Alam, Hengbo Tong, Swarna Das, Earl T. Barr, Wei Le

TL;DR

This paper reframes vulnerability detection as a complex, multistep code-reasoning task and introduces the SVEN dataset to benchmark LLMs. Through a large-scale evaluation of 14 SOTA LLMs across diverse prompts and a thorough error analysis, the authors show that scaling model size or training data yields little improvement, with balanced accuracy lingering around 50-55% and only modest gains from domain-informed prompts. The study identifies key failure modes in localization, semantics, and logical reasoning about code, arguing that execution-aware pretraining or fundamentally new modeling approaches may be required. By providing a detailed error taxonomy and open-sourcing tools and data, the work lays groundwork for more reliable vulnerability reasoning and broader software-engineering applications like debugging and program repair.

Abstract

In this paper, we present a challenging code reasoning task: vulnerability detection. Large Language Models (LLMs) have shown promising results in natural-language and math reasoning, but state-of-the-art (SOTA) models reported only 54.5% Balanced Accuracy in our vulnerability detection evaluation, even those models pre-trained on large amounts of source code. Our error analysis on LLM responses shows that the models struggle to reason about the code semantics relevant to identifying vulnerabilities, especially subtle semantic differences caused by small textual changes. We explored prominent models and training settings to understand their effects on vulnerability detection performance -- including better prompts, larger models, more pre-training data, and fine-tuning -- but none led to significant improvements. This raises the question of whether simply scaling training data and model size will allow us to "solve" complex code reasoning tasks like vulnerability detection, or if a fundamental shift in modeling and training techniques is required. We also explored adding domain knowledge to prompts; although it helped certain models understand some code semantics, vulnerability detection requires multi-step reasoning, and these models still failed in steps, such as reasoning about variable relations. Our results suggest that new models, new training methods, or more execution-specific pretraining data may be needed to conquer vulnerability detection. We speculate that auto-regressive pre-training on source code may not effectively extract code semantics, especially on the current pretraining mixtures, in which execution data is scarce. Success on vulnerability detection as a code reasoning task can benefit many areas of software engineering such as debugging, test input generation, and program repair. Our code and data are available at https://doi.org/10.6084/m9.figshare.27368025.

To Err is Machine: Vulnerability Detection Challenges LLM Reasoning

TL;DR

This paper reframes vulnerability detection as a complex, multistep code-reasoning task and introduces the SVEN dataset to benchmark LLMs. Through a large-scale evaluation of 14 SOTA LLMs across diverse prompts and a thorough error analysis, the authors show that scaling model size or training data yields little improvement, with balanced accuracy lingering around 50-55% and only modest gains from domain-informed prompts. The study identifies key failure modes in localization, semantics, and logical reasoning about code, arguing that execution-aware pretraining or fundamentally new modeling approaches may be required. By providing a detailed error taxonomy and open-sourcing tools and data, the work lays groundwork for more reliable vulnerability reasoning and broader software-engineering applications like debugging and program repair.

Abstract

In this paper, we present a challenging code reasoning task: vulnerability detection. Large Language Models (LLMs) have shown promising results in natural-language and math reasoning, but state-of-the-art (SOTA) models reported only 54.5% Balanced Accuracy in our vulnerability detection evaluation, even those models pre-trained on large amounts of source code. Our error analysis on LLM responses shows that the models struggle to reason about the code semantics relevant to identifying vulnerabilities, especially subtle semantic differences caused by small textual changes. We explored prominent models and training settings to understand their effects on vulnerability detection performance -- including better prompts, larger models, more pre-training data, and fine-tuning -- but none led to significant improvements. This raises the question of whether simply scaling training data and model size will allow us to "solve" complex code reasoning tasks like vulnerability detection, or if a fundamental shift in modeling and training techniques is required. We also explored adding domain knowledge to prompts; although it helped certain models understand some code semantics, vulnerability detection requires multi-step reasoning, and these models still failed in steps, such as reasoning about variable relations. Our results suggest that new models, new training methods, or more execution-specific pretraining data may be needed to conquer vulnerability detection. We speculate that auto-regressive pre-training on source code may not effectively extract code semantics, especially on the current pretraining mixtures, in which execution data is scarce. Success on vulnerability detection as a code reasoning task can benefit many areas of software engineering such as debugging, test input generation, and program repair. Our code and data are available at https://doi.org/10.6084/m9.figshare.27368025.
Paper Structure (17 sections, 12 figures, 8 tables)

This paper contains 17 sections, 12 figures, 8 tables.

Figures (12)

  • Figure 1: Examples of vulnerability detection as a complex code reasoning task. Diffed lines (+/-) show the lines changed to patch the vulnerability.
  • Figure 2: Vulnerability detection performance. Bar height shows the average performance of three random seeds and error bars show standard deviations; stars () mark the best-performing prompt for each model.
  • Figure 3: Error categories observed in responses from all LLMs. Bar width shows the number of responses that contained the category of error. One response can contain more than one type of error.
  • Figure 4: Missed Bounds/NULL check.
  • Figure 5: Misunderstood arithmetic operation.
  • ...and 7 more figures