Table of Contents
Fetching ...

Specification-Guided Vulnerability Detection with Large Language Models

Hao Zhu, Jia Li, Cuiyun Gao, Jiaru Qian, Yihong Dong, Huanyu Liu, Lecheng Wang, Ziliang Wang, Xiaolong Hu, Ge Li

TL;DR

VulInstruct addresses the weak security reasoning of large language models in vulnerability detection by mining explicit security specifications from historical vulnerabilities. It builds a Specification Knowledge Base with general specs from patch-worthy vulnerabilities and domain-specific specs from CVE data, then uses dual retrieval and guided reasoning to align LLM outputs with safe-behavior expectations. On PrimeVul under the CORRECT evaluation framework, it achieves state-of-the-art performance (F1 45.0%, recall 37.7%), detects unique vulnerabilities (24.3%), and demonstrates real-world utility by uncovering a previously unknown high-severity CVE (CVE-2025-56538). The results show that grounding LLM reasoning in structured security knowledge improves root-cause understanding, generalization across vulnerability types and models, and practical vulnerability discovery in real software projects.

Abstract

Large language models (LLMs) have achieved remarkable progress in code understanding tasks. However, they demonstrate limited performance in vulnerability detection and struggle to distinguish vulnerable code from patched code. We argue that LLMs lack understanding of security specifications -- the expectations about how code should behave to remain safe. When code behavior differs from these expectations, it becomes a potential vulnerability. However, such knowledge is rarely explicit in training data, leaving models unable to reason about security flaws. We propose VulInstruct, a specification-guided approach that systematically extracts security specifications from historical vulnerabilities to detect new ones. VulInstruct constructs a specification knowledge base from two perspectives: (i) General specifications from high-quality patches across projects, capturing fundamental safe behaviors; and (ii) Domain-specific specifications from repeated violations in particular repositories relevant to the target code. VulInstruct retrieves relevant past cases and specifications, enabling LLMs to reason about expected safe behaviors rather than relying on surface patterns. We evaluate VulInstruct under strict criteria requiring both correct predictions and valid reasoning. On PrimeVul, VulInstruct achieves 45.0% F1-score (32.7% improvement) and 37.7% recall (50.8% improvement) compared to baselines, while uniquely detecting 24.3% of vulnerabilities -- 2.4x more than any baseline. In pair-wise evaluation, VulInstruct achieves 32.3% relative improvement. VulInstruct also discovered a previously unknown high-severity vulnerability (CVE-2025-56538) in production code, demonstrating practical value for real-world vulnerability discovery. All code and supplementary materials are available at https://github.com/zhuhaopku/VulInstruct-temp.

Specification-Guided Vulnerability Detection with Large Language Models

TL;DR

VulInstruct addresses the weak security reasoning of large language models in vulnerability detection by mining explicit security specifications from historical vulnerabilities. It builds a Specification Knowledge Base with general specs from patch-worthy vulnerabilities and domain-specific specs from CVE data, then uses dual retrieval and guided reasoning to align LLM outputs with safe-behavior expectations. On PrimeVul under the CORRECT evaluation framework, it achieves state-of-the-art performance (F1 45.0%, recall 37.7%), detects unique vulnerabilities (24.3%), and demonstrates real-world utility by uncovering a previously unknown high-severity CVE (CVE-2025-56538). The results show that grounding LLM reasoning in structured security knowledge improves root-cause understanding, generalization across vulnerability types and models, and practical vulnerability discovery in real software projects.

Abstract

Large language models (LLMs) have achieved remarkable progress in code understanding tasks. However, they demonstrate limited performance in vulnerability detection and struggle to distinguish vulnerable code from patched code. We argue that LLMs lack understanding of security specifications -- the expectations about how code should behave to remain safe. When code behavior differs from these expectations, it becomes a potential vulnerability. However, such knowledge is rarely explicit in training data, leaving models unable to reason about security flaws. We propose VulInstruct, a specification-guided approach that systematically extracts security specifications from historical vulnerabilities to detect new ones. VulInstruct constructs a specification knowledge base from two perspectives: (i) General specifications from high-quality patches across projects, capturing fundamental safe behaviors; and (ii) Domain-specific specifications from repeated violations in particular repositories relevant to the target code. VulInstruct retrieves relevant past cases and specifications, enabling LLMs to reason about expected safe behaviors rather than relying on surface patterns. We evaluate VulInstruct under strict criteria requiring both correct predictions and valid reasoning. On PrimeVul, VulInstruct achieves 45.0% F1-score (32.7% improvement) and 37.7% recall (50.8% improvement) compared to baselines, while uniquely detecting 24.3% of vulnerabilities -- 2.4x more than any baseline. In pair-wise evaluation, VulInstruct achieves 32.3% relative improvement. VulInstruct also discovered a previously unknown high-severity vulnerability (CVE-2025-56538) in production code, demonstrating practical value for real-world vulnerability discovery. All code and supplementary materials are available at https://github.com/zhuhaopku/VulInstruct-temp.

Paper Structure

This paper contains 32 sections, 3 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: (a) CVE-2013-4488 vulnerability analysis: The vulnerable code path in libgadu showing missing hostname verification, and VulInstruct's automated specification extraction from the patch. (b) VulInstruct using specifications from historical cases to detect new vulnerability
  • Figure 2: Domain-specific Recurring exploitation mechanism in ImageMagick: unvalidated length fields across historical and new vulnerabilities.
  • Figure 3: Overview of VulInstruct
  • Figure 4: Knowledge Selection Threshold Ablation: The inverted-U relationship in RAG-based vulnerability detection. Spec represents general specifications, VulSpec denotes corresponding detailed vulnerability cases, and NVD indicates CVE cases (which are subsequently transformed into domain-specific specifications).
  • Figure 5: Comparison of two experimental results.
  • ...and 1 more figures