Table of Contents
Fetching ...

Wild SBOMs: a Large-scale Dataset of Software Bills of Materials from Public Code

Luıs Soeiro, Thomas Robert, Stefano Zacchiroli

TL;DR

This work tackles the lack of large-scale, real-world SBOM data by constructing a deduplicated dataset of 78,612 SBOM files mined from 94,618,356 public repositories. The dataset, built through a reproducible pipeline that leverages Software Heritage and the sbomqs scoring tool, includes metadata on SBOM standards, formats, provenance, and quality scores, enabling analyses of adoption, quality, and tooling in the wild. The authors demonstrate potential applications in SBOM practice assessment, tool benchmarking, software composition analysis, and vulnerability studies, and release the data openly with a replication package for future research. They also discuss limitations and outline future work to broaden coverage and enrich metadata from additional SBOM sources and tools.

Abstract

Developers gain productivity by reusing readily available Free and Open Source Software (FOSS) components. Such practices also bring some difficulties, such as managing licensing, components and related security. One approach to handle those difficulties is to use Software Bill of Materials (SBOMs). While there have been studies on the readiness of practitioners to embrace SBOMs and on the SBOM tools ecosystem, a large scale study on SBOM practices based on SBOM files produced in the wild is still lacking. A starting point for such a study is a large dataset of SBOM files found in the wild. We introduce such a dataset, consisting of over 78 thousand unique SBOM files, deduplicated from those found in over 94 million repositories. We include metadata that contains the standard and format used, quality score generated by the tool sbomqs, number of revisions, filenames and provenance information. Finally, we give suggestions and examples of research that could bring new insights on assessing and improving SBOM real practices.

Wild SBOMs: a Large-scale Dataset of Software Bills of Materials from Public Code

TL;DR

This work tackles the lack of large-scale, real-world SBOM data by constructing a deduplicated dataset of 78,612 SBOM files mined from 94,618,356 public repositories. The dataset, built through a reproducible pipeline that leverages Software Heritage and the sbomqs scoring tool, includes metadata on SBOM standards, formats, provenance, and quality scores, enabling analyses of adoption, quality, and tooling in the wild. The authors demonstrate potential applications in SBOM practice assessment, tool benchmarking, software composition analysis, and vulnerability studies, and release the data openly with a replication package for future research. They also discuss limitations and outline future work to broaden coverage and enrich metadata from additional SBOM sources and tools.

Abstract

Developers gain productivity by reusing readily available Free and Open Source Software (FOSS) components. Such practices also bring some difficulties, such as managing licensing, components and related security. One approach to handle those difficulties is to use Software Bill of Materials (SBOMs). While there have been studies on the readiness of practitioners to embrace SBOMs and on the SBOM tools ecosystem, a large scale study on SBOM practices based on SBOM files produced in the wild is still lacking. A starting point for such a study is a large dataset of SBOM files found in the wild. We introduce such a dataset, consisting of over 78 thousand unique SBOM files, deduplicated from those found in over 94 million repositories. We include metadata that contains the standard and format used, quality score generated by the tool sbomqs, number of revisions, filenames and provenance information. Finally, we give suggestions and examples of research that could bring new insights on assessing and improving SBOM real practices.

Paper Structure

This paper contains 17 sections, 5 figures.

Figures (5)

  • Figure 1: Overview of the SBOM dataset creation
  • Figure 2: Relational model for the SBOM dataset metadata
  • Figure 3: Distribution of SBOM standards and file formats
  • Figure 4: The most popular filenames for SBOM files
  • Figure 5: Forges with the most SBOM files