MegaVul: A C/C++ Vulnerability Dataset with Comprehensive Code Representation
Chao Ni, Liyu Shen, Xiaohu Yang, Yan Zhu, Shaohua Wang
TL;DR
MegaVul tackles the need for large-scale, realistic C/C++ vulnerability datasets by linking CVE entries to vulnerability-fixing commits across 28 Git platforms and extracting high-integrity, function-level code using Tree-sitter. It enriches each function with four representations (signature, abstracted, parsed AST/PDG, and code changes) and includes comprehensive metadata (CWE types, CVE descriptions) to support both graph-based and sequence-based learning. The dataset aggregates 17,380 vulnerable functions from 992 repositories across 169 CWE types (2006–2023) and 322,168 non-vulnerable functions, and is publicly available with continuous updates to improve vulnerability detection, patch identification, and automated analysis. By providing real-world provenance and diverse representations, MegaVul offers a robust benchmark for training and evaluating both data-driven vulnerability detectors and patch-identification methods, with demonstrated relevance to major projects and escalating vulnerability activity in the wild.
Abstract
We constructed a newly large-scale and comprehensive C/C++ vulnerability dataset named MegaVul by crawling the Common Vulnerabilities and Exposures (CVE) database and CVE-related open-source projects. Specifically, we collected all crawlable descriptive information of the vulnerabilities from the CVE database and extracted all vulnerability-related code changes from 28 Git-based websites. We adopt advanced tools to ensure the extracted code integrality and enrich the code with four different transformed representations. In total, MegaVul contains 17,380 vulnerabilities collected from 992 open-source repositories spanning 169 different vulnerability types disclosed from January 2006 to October 2023. Thus, MegaVul can be used for a variety of software security-related tasks including detecting vulnerabilities and assessing vulnerability severity. All information is stored in the JSON format for easy usage. MegaVul is publicly available on GitHub and will be continuously updated. It can be easily extended to other programming languages.
