Table of Contents
Fetching ...

ReposVul: A Repository-Level High-Quality Vulnerability Dataset

Xinchen Wang, Ruida Hu, Cuiyun Gao, Xin-Cheng Wen, Yujia Chen, Qing Liao

TL;DR

The paper tackles the poor data quality and limited context of existing OSS vulnerability datasets by proposing ReposVul, the first repository-level vulnerability dataset. It introduces a three-module automated framework: vulnerability untangling (LLMs plus static analysis to separate vulnerability-fixing changes from tangled patches), multi-granularity dependency extraction (capturing inter-procedural call relationships across repository, file, function, and line levels), and trace-based filtering (file-path and commit-time analysis to identify outdated patches). The dataset spans six thousand-plus CVE entries across four languages with extensive CWE coverage and rich patch metadata, and is validated against baselines and manual labeling, showing superior label quality and practical utility for DL-based vulnerability detection and patch management. ReposVul is publicly released to support standardized evaluation and broader research on inter-procedural vulnerabilities and timely vulnerability repair in OSS ecosystems.

Abstract

Open-Source Software (OSS) vulnerabilities bring great challenges to the software security and pose potential risks to our society. Enormous efforts have been devoted into automated vulnerability detection, among which deep learning (DL)-based approaches have proven to be the most effective. However, the current labeled data present the following limitations: (1) Tangled Patches: Developers may submit code changes unrelated to vulnerability fixes within patches, leading to tangled patches. (2) Lacking Inter-procedural Vulnerabilities: The existing vulnerability datasets typically contain function-level and file-level vulnerabilities, ignoring the relations between functions, thus rendering the approaches unable to detect the inter-procedural vulnerabilities. (3) Outdated Patches: The existing datasets usually contain outdated patches, which may bias the model during training. To address the above limitations, in this paper, we propose an automated data collection framework and construct the first repository-level high-quality vulnerability dataset named ReposVul. The proposed framework mainly contains three modules: (1) A vulnerability untangling module, aiming at distinguishing vulnerability-fixing related code changes from tangled patches, in which the Large Language Models (LLMs) and static analysis tools are jointly employed. (2) A multi-granularity dependency extraction module, aiming at capturing the inter-procedural call relationships of vulnerabilities, in which we construct multiple-granularity information for each vulnerability patch, including repository-level, file-level, function-level, and line-level. (3) A trace-based filtering module, aiming at filtering the outdated patches, which leverages the file path trace-based filter and commit time trace-based filter to construct an up-to-date dataset.

ReposVul: A Repository-Level High-Quality Vulnerability Dataset

TL;DR

The paper tackles the poor data quality and limited context of existing OSS vulnerability datasets by proposing ReposVul, the first repository-level vulnerability dataset. It introduces a three-module automated framework: vulnerability untangling (LLMs plus static analysis to separate vulnerability-fixing changes from tangled patches), multi-granularity dependency extraction (capturing inter-procedural call relationships across repository, file, function, and line levels), and trace-based filtering (file-path and commit-time analysis to identify outdated patches). The dataset spans six thousand-plus CVE entries across four languages with extensive CWE coverage and rich patch metadata, and is validated against baselines and manual labeling, showing superior label quality and practical utility for DL-based vulnerability detection and patch management. ReposVul is publicly released to support standardized evaluation and broader research on inter-procedural vulnerabilities and timely vulnerability repair in OSS ecosystems.

Abstract

Open-Source Software (OSS) vulnerabilities bring great challenges to the software security and pose potential risks to our society. Enormous efforts have been devoted into automated vulnerability detection, among which deep learning (DL)-based approaches have proven to be the most effective. However, the current labeled data present the following limitations: (1) Tangled Patches: Developers may submit code changes unrelated to vulnerability fixes within patches, leading to tangled patches. (2) Lacking Inter-procedural Vulnerabilities: The existing vulnerability datasets typically contain function-level and file-level vulnerabilities, ignoring the relations between functions, thus rendering the approaches unable to detect the inter-procedural vulnerabilities. (3) Outdated Patches: The existing datasets usually contain outdated patches, which may bias the model during training. To address the above limitations, in this paper, we propose an automated data collection framework and construct the first repository-level high-quality vulnerability dataset named ReposVul. The proposed framework mainly contains three modules: (1) A vulnerability untangling module, aiming at distinguishing vulnerability-fixing related code changes from tangled patches, in which the Large Language Models (LLMs) and static analysis tools are jointly employed. (2) A multi-granularity dependency extraction module, aiming at capturing the inter-procedural call relationships of vulnerabilities, in which we construct multiple-granularity information for each vulnerability patch, including repository-level, file-level, function-level, and line-level. (3) A trace-based filtering module, aiming at filtering the outdated patches, which leverages the file path trace-based filter and commit time trace-based filter to construct an up-to-date dataset.
Paper Structure (29 sections, 5 figures, 7 tables, 2 algorithms)

This paper contains 29 sections, 5 figures, 7 tables, 2 algorithms.

Figures (5)

  • Figure 1: Examples for illustrating the challenges of existing datasets. Lines highlighted in green denote added content, red indicates deleted content, yellow represents commit information, and blue identifies the caller and callee.
  • Figure 2: The architecture of our automatic data collection framework.
  • Figure 3: A sample prompt for LLMs to evaluate the relevance of code changes in one file to the vulnerability fixes.
  • Figure 4: Two solutions to fix the inter-procedural vulnerability. The tokens highlighted in green indicate code changes related to vulnerability fixing.
  • Figure 5: Outdated patches about CWEs, time, projects, and programming languages.