Table of Contents
Fetching ...

Reasoning with LLMs for Zero-Shot Vulnerability Detection

Arastoo Zibaeirad, Marco Vieira

TL;DR

This paper addresses the challenge of zero-shot vulnerability detection in real-world software by introducing VulnSage, a rigorous evaluation framework and dataset built from large-scale C/C++ projects. It couples a multi-granular code representation with a two-stage noise mitigation pipeline and four zero-shot prompting strategies (Baseline, CoT, Think, Think & Verify) to probe LLM reasoning in vulnerability detection and patch verification. The key findings show that structured reasoning, especially Think & Verify, reduces ambiguity and boosts accuracy, while code-specialized models consistently outperform general-purpose ones, though no single approach excels across all CWE types or granularities. The VulnSage framework provides a publicly available, extensible benchmark that better reflects real-world security challenges and helps advance robust, reasoning-driven SVD tools.

Abstract

Automating software vulnerability detection (SVD) remains a critical challenge in an era of increasingly complex and interdependent software systems. Despite significant advances in Large Language Models (LLMs) for code analysis, prevailing evaluation methodologies often lack the \textbf{context-aware robustness} necessary to capture real-world intricacies and cross-component interactions. To address these limitations, we present \textbf{VulnSage}, a comprehensive evaluation framework and a dataset curated from diverse, large-scale open-source system software projects developed in C/C++. Unlike prior datasets, it leverages a heuristic noise pre-filtering approach combined with LLM-based reasoning to ensure a representative and minimally noisy spectrum of vulnerabilities. The framework supports multi-granular analysis across function, file, and inter-function levels and employs four diverse zero-shot prompt strategies: Baseline, Chain-of-Thought, Think, and Think & Verify. Through this evaluation, we uncover that structured reasoning prompts substantially improve LLM performance, with Think & Verify reducing ambiguous responses from 20.3% to 9.1% while increasing accuracy. We further demonstrate that code-specialized models consistently outperform general-purpose alternatives, with performance varying significantly across vulnerability types, revealing that no single approach universally excels across all security contexts. Link to dataset and codes: https://github.com/Erroristotle/VulnSage.git

Reasoning with LLMs for Zero-Shot Vulnerability Detection

TL;DR

This paper addresses the challenge of zero-shot vulnerability detection in real-world software by introducing VulnSage, a rigorous evaluation framework and dataset built from large-scale C/C++ projects. It couples a multi-granular code representation with a two-stage noise mitigation pipeline and four zero-shot prompting strategies (Baseline, CoT, Think, Think & Verify) to probe LLM reasoning in vulnerability detection and patch verification. The key findings show that structured reasoning, especially Think & Verify, reduces ambiguity and boosts accuracy, while code-specialized models consistently outperform general-purpose ones, though no single approach excels across all CWE types or granularities. The VulnSage framework provides a publicly available, extensible benchmark that better reflects real-world security challenges and helps advance robust, reasoning-driven SVD tools.

Abstract

Automating software vulnerability detection (SVD) remains a critical challenge in an era of increasingly complex and interdependent software systems. Despite significant advances in Large Language Models (LLMs) for code analysis, prevailing evaluation methodologies often lack the \textbf{context-aware robustness} necessary to capture real-world intricacies and cross-component interactions. To address these limitations, we present \textbf{VulnSage}, a comprehensive evaluation framework and a dataset curated from diverse, large-scale open-source system software projects developed in C/C++. Unlike prior datasets, it leverages a heuristic noise pre-filtering approach combined with LLM-based reasoning to ensure a representative and minimally noisy spectrum of vulnerabilities. The framework supports multi-granular analysis across function, file, and inter-function levels and employs four diverse zero-shot prompt strategies: Baseline, Chain-of-Thought, Think, and Think & Verify. Through this evaluation, we uncover that structured reasoning prompts substantially improve LLM performance, with Think & Verify reducing ambiguous responses from 20.3% to 9.1% while increasing accuracy. We further demonstrate that code-specialized models consistently outperform general-purpose alternatives, with performance varying significantly across vulnerability types, revealing that no single approach universally excels across all security contexts. Link to dataset and codes: https://github.com/Erroristotle/VulnSage.git

Paper Structure

This paper contains 21 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: VulnSage Architecture
  • Figure 2: Aggregated accuracy of vulnerability detection and patch verification for the top 10 most frequent CWEs in our dataset, which align with commonly occurring vulnerabilities in real-world C and C++ system projects. Each heatmap represents a different prompting strategy, showing how accuracy varies across CWE types.
  • Figure 3: Correlation between Noise and LLMs Performance. The figure shows how different prompting strategies perform on vulnerability detection (solid lines) and patch verification (dotted lines) tasks as noise in the input code increases from 0% to 100%, with each line representing the mean accuracy across all evaluated models.
  • Figure 4: Impact of Granularity on LLM Performance Across Prompt Strategies. The figure shows vulnerability detection (solid bars) and patch verification (hatched bars) correctness. The y-axis represents aggregated accuracy, where stacked bars sum both tasks (e.g., 100% in each totals 200%).