Table of Contents
Fetching ...

A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software

Serena E. Ponta, Henrik Plate, Antonino Sabetta, Michele Bezzi, Cédric Dangremont

TL;DR

The paper presents a manually curated, code-centric dataset mapping OSS vulnerability disclosures to fix commits, addressing limitations of metadata-based approaches. It combines data from NVD and project-specific advisories to 624 vulnerabilities across 205 Java OSS projects and 1282 fix commits, and releases the data as an Apache-2.0 CSV with companion scripts. The work demonstrates the dataset's utility by enabling automatic classification of security-relevant commits and by enabling analyses of fix-to-release delays, deduplication, and negative sampling for ML tasks. The authors advocate for industrially relevant, community-maintained data and provide a path towards expanding language coverage and collaborative curation.

Abstract

Advancing our understanding of software vulnerabilities, automating their identification, the analysis of their impact, and ultimately their mitigation is necessary to enable the development of software that is more secure. While operating a vulnerability assessment tool that we developed and that is currently used by hundreds of development units at SAP, we manually collected and curated a dataset of vulnerabilities of open-source software and the commits fixing them. The data was obtained both from the National Vulnerability Database (NVD) and from project-specific Web resources that we monitor on a continuous basis. From that data, we extracted a dataset that maps 624 publicly disclosed vulnerabilities affecting 205 distinct open-source Java projects, used in SAP products or internal tools, onto the 1282 commits that fix them. Out of 624 vulnerabilities, 29 do not have a CVE identifier at all and 46, which do have a CVE identifier assigned by a numbering authority, are not available in the NVD yet. The dataset is released under an open-source license, together with supporting scripts that allow researchers to automatically retrieve the actual content of the commits from the corresponding repositories and to augment the attributes available for each instance. Also, these scripts allow to complement the dataset with additional instances that are not security fixes (which is useful, for example, in machine learning applications). Our dataset has been successfully used to train classifiers that could automatically identify security-relevant commits in code repositories. The release of this dataset and the supporting code as open-source will allow future research to be based on data of industrial relevance; also, it represents a concrete step towards making the maintenance of this dataset a shared effort involving open-source communities, academia, and the industry.

A Manually-Curated Dataset of Fixes to Vulnerabilities of Open-Source Software

TL;DR

The paper presents a manually curated, code-centric dataset mapping OSS vulnerability disclosures to fix commits, addressing limitations of metadata-based approaches. It combines data from NVD and project-specific advisories to 624 vulnerabilities across 205 Java OSS projects and 1282 fix commits, and releases the data as an Apache-2.0 CSV with companion scripts. The work demonstrates the dataset's utility by enabling automatic classification of security-relevant commits and by enabling analyses of fix-to-release delays, deduplication, and negative sampling for ML tasks. The authors advocate for industrially relevant, community-maintained data and provide a path towards expanding language coverage and collaborative curation.

Abstract

Advancing our understanding of software vulnerabilities, automating their identification, the analysis of their impact, and ultimately their mitigation is necessary to enable the development of software that is more secure. While operating a vulnerability assessment tool that we developed and that is currently used by hundreds of development units at SAP, we manually collected and curated a dataset of vulnerabilities of open-source software and the commits fixing them. The data was obtained both from the National Vulnerability Database (NVD) and from project-specific Web resources that we monitor on a continuous basis. From that data, we extracted a dataset that maps 624 publicly disclosed vulnerabilities affecting 205 distinct open-source Java projects, used in SAP products or internal tools, onto the 1282 commits that fix them. Out of 624 vulnerabilities, 29 do not have a CVE identifier at all and 46, which do have a CVE identifier assigned by a numbering authority, are not available in the NVD yet. The dataset is released under an open-source license, together with supporting scripts that allow researchers to automatically retrieve the actual content of the commits from the corresponding repositories and to augment the attributes available for each instance. Also, these scripts allow to complement the dataset with additional instances that are not security fixes (which is useful, for example, in machine learning applications). Our dataset has been successfully used to train classifiers that could automatically identify security-relevant commits in code repositories. The release of this dataset and the supporting code as open-source will allow future research to be based on data of industrial relevance; also, it represents a concrete step towards making the maintenance of this dataset a shared effort involving open-source communities, academia, and the industry.

Paper Structure

This paper contains 5 sections, 1 equation, 4 figures, 1 table.

Figures (4)

  • Figure 1: Number of vulnerabilities per number of commits performed for fixing them (note: the y-axis uses a logarithmic scale).
  • Figure 2: Number of vulnerabilities per year.
  • Figure 3: Number of repositories per number of vulnerabilities (note: the y-axis uses a logarithmic scale).
  • Figure 4: Number of days from fix commit to release (note: the y-axis uses a logarithmic scale).