Table of Contents
Fetching ...

Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

Yikun Li, Ngoc Tan Bui, Ting Zhang, Chengran Yang, Xin Zhou, Martin Weyssow, Jinfeng Jiang, Junkai Chen, Huihui Huang, Huu Hung Nguyen, Chiok Yew Ho, Jie Tan, Ruiyin Li, Yide Yin, Han Wei Ang, Frank Liauw, Eng Lieh Ouh, Lwin Khin Shar, David Lo

TL;DR

This work addresses the limited real-world impact of ML-based vulnerability detection by revealing pervasive data quality issues in public vulnerability datasets and the resulting generalization gap between In-Distribution and Out-of-Distribution performance. It introduces BenchVul, a manually verified benchmark for MITRE's Top 25 CWEs, TitanVul, a large-scale high-quality training dataset, and RVG, a multi-agent LLM-based vulnerability synthesis pipeline. Empirical results show that models trained on TitanVul generalize far better to BenchVul's real-world vulnerabilities than models trained on larger, noisier datasets, and that augmenting TitanVul with RVG data further boosts OOD performance, especially for underrepresented CWEs. Together, these contributions provide reliable evaluation resources and practical guidance for building models that generalize to real-world vulnerability detection tasks, with implications for data curation and synthetic data generation in security-sensitive AI systems.

Abstract

Automated vulnerability detection research has made substantial progress, yet its real-world impact remains limited. Prior work found that current vulnerability datasets suffer from issues including label inaccuracy rates of 20%-71%, extensive duplication, and poor coverage of critical Common Weakness Enumeration (CWE). These issues create a significant generalization gap where models achieve misleading In-Distribution (ID) accuracies (testing on splits from the same dataset) by exploiting spurious correlations rather than learning true vulnerability patterns. To address these limitations, we present a three-part solution. First, we introduce BenchVul, which is a manually curated and balanced test dataset covering the MITRE Top 25 Most Dangerous CWEs, to enable fair model evaluation. Second, we construct a high-quality training dataset, TitanVul, comprising 38,548 functions by aggregating seven public sources and applying deduplication and validation using a novel multi-agent LLM pipeline. Third, we propose a Realistic Vulnerability Generation (RVG) pipeline, which synthesizes context-aware vulnerability examples for underrepresented but critical CWE types through simulated development workflows. Our evaluation reveals that In-Distribution (ID) performance does not reliably predict Out-of-Distribution (OOD) performance on BenchVul. For example, a model trained on BigVul achieves the highest 0.703 ID accuracy but fails on BenchVul's real-world samples (0.493 OOD accuracy). Conversely, a model trained on our TitanVul achieves the highest OOD performance on both the real-world (0.881) and synthesized (0.785) portions of BenchVul, improving upon the next-best performing dataset by 5.3% and 11.8% respectively, despite a modest ID score (0.590). Augmenting TitanVul with our RVG further boosts this leading OOD performance, improving accuracy on real-world data by 5.8% (to 0.932).

Out of Distribution, Out of Luck: How Well Can LLMs Trained on Vulnerability Datasets Detect Top 25 CWE Weaknesses?

TL;DR

This work addresses the limited real-world impact of ML-based vulnerability detection by revealing pervasive data quality issues in public vulnerability datasets and the resulting generalization gap between In-Distribution and Out-of-Distribution performance. It introduces BenchVul, a manually verified benchmark for MITRE's Top 25 CWEs, TitanVul, a large-scale high-quality training dataset, and RVG, a multi-agent LLM-based vulnerability synthesis pipeline. Empirical results show that models trained on TitanVul generalize far better to BenchVul's real-world vulnerabilities than models trained on larger, noisier datasets, and that augmenting TitanVul with RVG data further boosts OOD performance, especially for underrepresented CWEs. Together, these contributions provide reliable evaluation resources and practical guidance for building models that generalize to real-world vulnerability detection tasks, with implications for data curation and synthetic data generation in security-sensitive AI systems.

Abstract

Automated vulnerability detection research has made substantial progress, yet its real-world impact remains limited. Prior work found that current vulnerability datasets suffer from issues including label inaccuracy rates of 20%-71%, extensive duplication, and poor coverage of critical Common Weakness Enumeration (CWE). These issues create a significant generalization gap where models achieve misleading In-Distribution (ID) accuracies (testing on splits from the same dataset) by exploiting spurious correlations rather than learning true vulnerability patterns. To address these limitations, we present a three-part solution. First, we introduce BenchVul, which is a manually curated and balanced test dataset covering the MITRE Top 25 Most Dangerous CWEs, to enable fair model evaluation. Second, we construct a high-quality training dataset, TitanVul, comprising 38,548 functions by aggregating seven public sources and applying deduplication and validation using a novel multi-agent LLM pipeline. Third, we propose a Realistic Vulnerability Generation (RVG) pipeline, which synthesizes context-aware vulnerability examples for underrepresented but critical CWE types through simulated development workflows. Our evaluation reveals that In-Distribution (ID) performance does not reliably predict Out-of-Distribution (OOD) performance on BenchVul. For example, a model trained on BigVul achieves the highest 0.703 ID accuracy but fails on BenchVul's real-world samples (0.493 OOD accuracy). Conversely, a model trained on our TitanVul achieves the highest OOD performance on both the real-world (0.881) and synthesized (0.785) portions of BenchVul, improving upon the next-best performing dataset by 5.3% and 11.8% respectively, despite a modest ID score (0.590). Augmenting TitanVul with our RVG further boosts this leading OOD performance, improving accuracy on real-world data by 5.8% (to 0.932).

Paper Structure

This paper contains 24 sections, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Distribution of CWE types across six major vulnerability datasets.
  • Figure 2: Vulnerability dataset duplication matrix.
  • Figure 3: Distribution of MITRE top 25 most dangerous CWE across the consolidated vulnerability dataset.
  • Figure 4: Overview of the BenchVul construction pipeline for the MITRE Top 25 Most Dangerous CWEs.
  • Figure 5: Heatmap of similarity scores between vulnerability datasets, including BenchVul "Real" and "Synth" data.
  • ...and 5 more figures