Table of Contents
Fetching ...

Data and Context Matter: Towards Generalizing AI-based Software Vulnerability Detection

Rijha Safdar, Danyail Mateen, Syed Taha Ali, M. Umer Ashfaq, Wajahat Hussain

TL;DR

The paper addresses the poor generalization of AI-based vulnerability detection across unseen codebases and investigates how model architecture, training data quality, and context length impact performance. It introduces VulGate, a rigorously curated, diverse dataset with hard negatives and rich metadata, designed to improve generalization and support multiple tasks like localization and patching. Through extensive experiments, encoder-only models with larger context windows (notably UniXcoder-Base-Nine) outperform decoder-based models and static analyzers, achieving strong cross-project generalization and substantial recall gains (e.g., 6.8% absolute recall improvement on BigVul). The work demonstrates that data quality and model choice are critical for robust vulnerability detection, and it provides a practical dataset and benchmarking framework to advance cross-project effectiveness and future multi-language extensions.

Abstract

AI-based solutions demonstrate remarkable results in identifying vulnerabilities in software, but research has consistently found that this performance does not generalize to unseen codebases. In this paper, we specifically investigate the impact of model architecture, parameter configuration, and quality of training data on the ability of these systems to generalize. For this purpose, we introduce VulGate, a high quality state of the art dataset that mitigates the shortcomings of prior datasets, by removing mislabeled and duplicate samples, updating new vulnerabilities, incorporating additional metadata, integrating hard samples, and including dedicated test sets. We undertake a series of experiments to demonstrate that improved dataset diversity and quality substantially enhances vulnerability detection. We also introduce and benchmark multiple encoder-only and decoder-only models. We find that encoder-based models outperform other models in terms of accuracy and generalization. Our model achieves \textbf{6.8\%} improvement in recall on the benchmark BigVul dataset and outperforms others on unseen projects, demonstrating enhanced generalizability. Our results highlight the role of data quality and model selection in the development of robust vulnerability detection systems. Our findings suggest a direction for future systems with high cross-project effectiveness.

Data and Context Matter: Towards Generalizing AI-based Software Vulnerability Detection

TL;DR

The paper addresses the poor generalization of AI-based vulnerability detection across unseen codebases and investigates how model architecture, training data quality, and context length impact performance. It introduces VulGate, a rigorously curated, diverse dataset with hard negatives and rich metadata, designed to improve generalization and support multiple tasks like localization and patching. Through extensive experiments, encoder-only models with larger context windows (notably UniXcoder-Base-Nine) outperform decoder-based models and static analyzers, achieving strong cross-project generalization and substantial recall gains (e.g., 6.8% absolute recall improvement on BigVul). The work demonstrates that data quality and model choice are critical for robust vulnerability detection, and it provides a practical dataset and benchmarking framework to advance cross-project effectiveness and future multi-language extensions.

Abstract

AI-based solutions demonstrate remarkable results in identifying vulnerabilities in software, but research has consistently found that this performance does not generalize to unseen codebases. In this paper, we specifically investigate the impact of model architecture, parameter configuration, and quality of training data on the ability of these systems to generalize. For this purpose, we introduce VulGate, a high quality state of the art dataset that mitigates the shortcomings of prior datasets, by removing mislabeled and duplicate samples, updating new vulnerabilities, incorporating additional metadata, integrating hard samples, and including dedicated test sets. We undertake a series of experiments to demonstrate that improved dataset diversity and quality substantially enhances vulnerability detection. We also introduce and benchmark multiple encoder-only and decoder-only models. We find that encoder-based models outperform other models in terms of accuracy and generalization. Our model achieves \textbf{6.8\%} improvement in recall on the benchmark BigVul dataset and outperforms others on unseen projects, demonstrating enhanced generalizability. Our results highlight the role of data quality and model selection in the development of robust vulnerability detection systems. Our findings suggest a direction for future systems with high cross-project effectiveness.

Paper Structure

This paper contains 17 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Automated End-to End data collection pipeline for VulGate. The workflow integrates CVE/CWE records, GitHub commits, and function-level parsing to build a structured vulnerability dataset.
  • Figure 2: A visual example of vulnerability (CVE-2021-40568) correctly identified by UniXcoder-Base-Nine: Vulnerable lines in svc_parse_slice. The check on pps_id is insufficient, leading to a potential buffer overflow.