Reasoning with LLMs for Zero-Shot Vulnerability Detection

Arastoo Zibaeirad; Marco Vieira

Reasoning with LLMs for Zero-Shot Vulnerability Detection

Arastoo Zibaeirad, Marco Vieira

TL;DR

This paper addresses the challenge of zero-shot vulnerability detection in real-world software by introducing VulnSage, a rigorous evaluation framework and dataset built from large-scale C/C++ projects. It couples a multi-granular code representation with a two-stage noise mitigation pipeline and four zero-shot prompting strategies (Baseline, CoT, Think, Think & Verify) to probe LLM reasoning in vulnerability detection and patch verification. The key findings show that structured reasoning, especially Think & Verify, reduces ambiguity and boosts accuracy, while code-specialized models consistently outperform general-purpose ones, though no single approach excels across all CWE types or granularities. The VulnSage framework provides a publicly available, extensible benchmark that better reflects real-world security challenges and helps advance robust, reasoning-driven SVD tools.

Abstract

Automating software vulnerability detection (SVD) remains a critical challenge in an era of increasingly complex and interdependent software systems. Despite significant advances in Large Language Models (LLMs) for code analysis, prevailing evaluation methodologies often lack the \textbf{context-aware robustness} necessary to capture real-world intricacies and cross-component interactions. To address these limitations, we present \textbf{VulnSage}, a comprehensive evaluation framework and a dataset curated from diverse, large-scale open-source system software projects developed in C/C++. Unlike prior datasets, it leverages a heuristic noise pre-filtering approach combined with LLM-based reasoning to ensure a representative and minimally noisy spectrum of vulnerabilities. The framework supports multi-granular analysis across function, file, and inter-function levels and employs four diverse zero-shot prompt strategies: Baseline, Chain-of-Thought, Think, and Think & Verify. Through this evaluation, we uncover that structured reasoning prompts substantially improve LLM performance, with Think & Verify reducing ambiguous responses from 20.3% to 9.1% while increasing accuracy. We further demonstrate that code-specialized models consistently outperform general-purpose alternatives, with performance varying significantly across vulnerability types, revealing that no single approach universally excels across all security contexts. Link to dataset and codes: https://github.com/Erroristotle/VulnSage.git

Reasoning with LLMs for Zero-Shot Vulnerability Detection

TL;DR

Abstract

Reasoning with LLMs for Zero-Shot Vulnerability Detection

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (4)