VulZoo: A Comprehensive Vulnerability Intelligence Dataset
Bonan Ruan, Jiahao Liu, Weibo Zhao, Zhenkai Liang
TL;DR
The paper tackles fragmentation in vulnerability intelligence by introducing VulZoo, a 6 GB, multi-source dataset that integrates 17 data sources into a unified structure. It outlines a data collection pipeline with four crawling strategies, deduplication, JSON normalization, and 11 cross-source relationships that form a topology graph for rich vulnerability profiling. The dataset spans five content categories (CVE Records, Assessments, PoCs, Mails, Patches) with extensive counts across MITRE, NVD, ZDI, GitHub, KEV, CWE, CAPEC, ATT&CK, D3FEND, AttackerKB, Exploit-DB, and mailing lists, plus thousands of PoCs and patches. Three application scenarios—severity/type prediction, intelligence alignment, and information augmentation—demonstrate VulZoo’s practical utility, and the authors provide public scripts and access to enable incremental updates for ongoing vulnerability assessment and prioritization research.
Abstract
Software vulnerabilities pose critical security and risk concerns for many software systems. Many techniques have been proposed to effectively assess and prioritize these vulnerabilities before they cause serious consequences. To evaluate their performance, these solutions often craft their own experimental datasets from limited information sources, such as MITRE CVE and NVD, lacking a global overview of broad vulnerability intelligence. The repetitive data preparation process further complicates the verification and comparison of new solutions. To resolve this issue, in this paper, we propose VulZoo, a comprehensive vulnerability intelligence dataset that covers 17 popular vulnerability information sources. We also construct connections among these sources, enabling more straightforward configuration and adaptation for different vulnerability assessment tasks (e.g., vulnerability type prediction). Additionally, VulZoo provides utility scripts for automatic data synchronization and cleaning, relationship mining, and statistics generation. We make VulZoo publicly available and maintain it with incremental updates to facilitate future research. We believe that VulZoo serves as a valuable input to vulnerability assessment and prioritization studies. The dataset with utility scripts is available at https://github.com/NUS-Curiosity/VulZoo.
