Table of Contents
Fetching ...

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

Yizheng Chen, Zhoujie Ding, Lamya Alowain, Xinyun Chen, David Wagner

TL;DR

DiverseVul provides a large, real-world dataset of vulnerable and non-vulnerable C/C++ functions to benchmark deep learning for vulnerability detection. The study compares 11 architectures across four model families (GNNs and three families of LLMs) on merged data from CVEFixes, previous datasets, and DiverseVul, revealing that large language models outperform GNNs when trained on substantial data, particularly with code-specific pretraining (CodeT5, NatGen). A key challenge remains generalization to unseen projects, with notable performance gaps and label-noise issues, though weighting schemes (class weights) offer improvements. The work emphasizes data diversity, code-focused pretraining, and practical deployment hurdles, and releases DiverseVul to spur further research in robust, scalable vulnerability detection.

Abstract

We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits. Our dataset covers 295 more projects than all previous datasets combined. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. We show that increasing the volume of training data may not further improve the performance of deep learning models for vulnerability detection, but might be useful to improve the generalization ability to unseen projects. We also identify hopeful future research directions. We demonstrate that large language models (LLMs) are a promising research direction for ML-based vulnerability detection, outperforming Graph Neural Networks (GNNs) with code-structure features in our experiments. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.

DiverseVul: A New Vulnerable Source Code Dataset for Deep Learning Based Vulnerability Detection

TL;DR

DiverseVul provides a large, real-world dataset of vulnerable and non-vulnerable C/C++ functions to benchmark deep learning for vulnerability detection. The study compares 11 architectures across four model families (GNNs and three families of LLMs) on merged data from CVEFixes, previous datasets, and DiverseVul, revealing that large language models outperform GNNs when trained on substantial data, particularly with code-specific pretraining (CodeT5, NatGen). A key challenge remains generalization to unseen projects, with notable performance gaps and label-noise issues, though weighting schemes (class weights) offer improvements. The work emphasizes data diversity, code-focused pretraining, and practical deployment hurdles, and releases DiverseVul to spur further research in robust, scalable vulnerability detection.

Abstract

We propose and release a new vulnerable source code dataset. We curate the dataset by crawling security issue websites, extracting vulnerability-fixing commits and source codes from the corresponding projects. Our new dataset contains 18,945 vulnerable functions spanning 150 CWEs and 330,492 non-vulnerable functions extracted from 7,514 commits. Our dataset covers 295 more projects than all previous datasets combined. Combining our new dataset with previous datasets, we present an analysis of the challenges and promising research directions of using deep learning for detecting software vulnerabilities. We study 11 model architectures belonging to 4 families. Our results show that deep learning is still not ready for vulnerability detection, due to high false positive rate, low F1 score, and difficulty of detecting hard CWEs. In particular, we demonstrate an important generalization challenge for the deployment of deep learning-based models. We show that increasing the volume of training data may not further improve the performance of deep learning models for vulnerability detection, but might be useful to improve the generalization ability to unseen projects. We also identify hopeful future research directions. We demonstrate that large language models (LLMs) are a promising research direction for ML-based vulnerability detection, outperforming Graph Neural Networks (GNNs) with code-structure features in our experiments. Moreover, developing source code specific pre-training objectives is a promising research direction to improve the vulnerability detection performance.
Paper Structure (29 sections, 4 figures, 8 tables)

This paper contains 29 sections, 4 figures, 8 tables.

Figures (4)

  • Figure 1: An overview of several of our results. When trained on only the CVEFixes dataset, ReVeal has comparable performance as large language models. If we have enough data (Previous + DiverseVul), large language models (e.g., NatGen) are superior to previous-generation models (e.g., ReVeal, a GNN model with code-structure features), but we need large datasets to see these benefits. LLMs are better able to take advantage of larger datasets than previous-generation models (blue bars vs gray bars). The best LLMs for this task, CodeT5 and NatGen, have been pre-trained with code-specific tasks.
  • Figure 2: We visualize the performance of models that are trained on CVEFixes, Previous, and Previous + DiverseVul. Adding DiverseVul to the merged Previous dataset helps improve the test performance for 7 models out of 11. It does not help the CodeT5 models.
  • Figure 3: Deep learning for vulnerable source code detection benefits from more data collected from the same distribution as the test data. We fine-tune CodeT5 Small models on different amounts of vulnerable source code data with different volume and report the test F1 score. We run each dataset setup 10 times. The lines are the average, and the region denotes 95% confidence interval. This figure shows that a larger training set improves the F1 score on vulnerability detection on test data from the same distribution.
  • Figure 4: Using class weights in the training loss function improves the generalization performance over unseen projects for CodeT5 Small, and it slightly improves the performance on seen projects as well. The test F1 score on unseen projects is still quite low.