Table of Contents
Fetching ...

Beyond Dependencies: The Role of Copy-Based Reuse in Open Source Software Development

Mahmoud Jahanshahi, David Reid, Audris Mockus

TL;DR

This work quantifies copy-based reuse across the Open Source Software ecosystem using World of Code, revealing that about $6.9\%$ of blobs are reused and that roughly $\ge 61\%$ of OSS projects engage in copy-based reuse, with notable contributions from medium- and small-sized projects. It introduces a blob-level, time-bounded measurement framework and applies regression analyses to identify blob and project characteristics that influence reuse probability, including language, binary content, and project age. The study also triangulates quantitative findings with a three-round developer survey to capture motivations, perceived usefulness, and openness to tooling, finding that creators often intend reuse and reusers are generally positive but exhibit low concern for potential bugs. The results highlight language- and ecosystem-specific patterns, the significant role of binary artifacts, and the need for tailored tooling (security, licensing, and package-management) to manage the OSS supply chain effectively. Overall, the paper establishes a scalable baseline for understanding and supporting copy-based reuse while outlining critical directions for future research and tooling development to mitigate associated risks.

Abstract

In Open Source Software, resources of any project are open for reuse by introducing dependencies or copying the resource itself. In contrast to dependency-based reuse, the infrastructure to systematically support copy-based reuse appears to be entirely missing. Our aim is to enable future research and tool development to increase efficiency and reduce the risks of copy-based reuse. We seek a better understanding of such reuse by measuring its prevalence and identifying factors affecting the propensity to reuse. To identify reused artifacts and trace their origins, our method exploits World of Code infrastructure. We begin with a set of theory-derived factors related to the propensity to reuse, sample instances of different reuse types, and survey developers to better understand their intentions. Our results indicate that copy-based reuse is common, with many developers being aware of it when writing code. The propensity for a file to be reused varies greatly among languages and between source code and binary files, consistently decreasing over time. Files introduced by popular projects are more likely to be reused, but at least half of reused resources originate from ``small'' and ``medium'' projects. Developers had various reasons for reuse but were generally positive about using a package manager.

Beyond Dependencies: The Role of Copy-Based Reuse in Open Source Software Development

TL;DR

This work quantifies copy-based reuse across the Open Source Software ecosystem using World of Code, revealing that about of blobs are reused and that roughly of OSS projects engage in copy-based reuse, with notable contributions from medium- and small-sized projects. It introduces a blob-level, time-bounded measurement framework and applies regression analyses to identify blob and project characteristics that influence reuse probability, including language, binary content, and project age. The study also triangulates quantitative findings with a three-round developer survey to capture motivations, perceived usefulness, and openness to tooling, finding that creators often intend reuse and reusers are generally positive but exhibit low concern for potential bugs. The results highlight language- and ecosystem-specific patterns, the significant role of binary artifacts, and the need for tailored tooling (security, licensing, and package-management) to manage the OSS supply chain effectively. Overall, the paper establishes a scalable baseline for understanding and supporting copy-based reuse while outlining critical directions for future research and tooling development to mitigate associated risks.

Abstract

In Open Source Software, resources of any project are open for reuse by introducing dependencies or copying the resource itself. In contrast to dependency-based reuse, the infrastructure to systematically support copy-based reuse appears to be entirely missing. Our aim is to enable future research and tool development to increase efficiency and reduce the risks of copy-based reuse. We seek a better understanding of such reuse by measuring its prevalence and identifying factors affecting the propensity to reuse. To identify reused artifacts and trace their origins, our method exploits World of Code infrastructure. We begin with a set of theory-derived factors related to the propensity to reuse, sample instances of different reuse types, and survey developers to better understand their intentions. Our results indicate that copy-based reuse is common, with many developers being aware of it when writing code. The propensity for a file to be reused varies greatly among languages and between source code and binary files, consistently decreasing over time. Files introduced by popular projects are more likely to be reused, but at least half of reused resources originate from ``small'' and ``medium'' projects. Developers had various reasons for reuse but were generally positive about using a package manager.
Paper Structure (69 sections, 1 equation, 5 figures, 18 tables)

This paper contains 69 sections, 1 equation, 5 figures, 18 tables.

Figures (5)

  • Figure 1: Reuse Identification Data Flow Diagram
  • Figure 2: Quarterly Reuse Trends
  • Figure 3: Blob-level Model - Logistic Regression Odds Ratios
  • Figure 4: Reused Blobs to Total Generated Blobs Ratio Trend in JavaScript
  • Figure 5: Project-level Model - Logistic Regression Odds Ratios