Table of Contents
Fetching ...

Software Dependencies 2.0: An Empirical Study of Reuse and Integration of Pre-Trained Models in Open-Source Projects

Jerin Yasmin, Wenxin Jiang, James C. Davis, Yuan Tian

TL;DR

This study investigates Software Dependencies 2.0 in open-source software (OSS) projects by examining the reuse of PTMs, with a focus on how developers manage and integrate these models.

Abstract

Pre-trained models (PTMs) are machine learning models that have been trained in advance, often on large-scale data, and can be reused for new tasks, thereby reducing the need for costly training from scratch. Their widespread adoption introduces a new class of software dependency, which we term Software Dependencies 2.0, extending beyond conventional libraries to learned behaviors embodied in trained models and their associated artifacts. The integration of PTMs as software dependencies in real projects remains unclear, potentially threatening maintainability and reliability of modern software systems that increasingly rely on them. Objective: In this study, we investigate Software Dependencies 2.0 in open-source software (OSS) projects by examining the reuse of PTMs, with a focus on how developers manage and integrate these models. Specifically, we seek to understand: (1) how OSS projects structure and document their PTM dependencies; (2) what stages and organizational patterns emerge in the reuse pipelines of PTMs within these projects; and (3) the interactions among PTMs and other learned components across pipeline stages. We conduct a mixed-methods analysis of a statistically significant random sample of 401 GitHub repositories from the PeaTMOSS dataset (28,575 repositories reusing PTMs from Hugging Face and PyTorch Hub). We quantitatively examine PTM reuse by identifying patterns and qualitatively investigate how developers integrate and manage these models in practice.

Software Dependencies 2.0: An Empirical Study of Reuse and Integration of Pre-Trained Models in Open-Source Projects

TL;DR

This study investigates Software Dependencies 2.0 in open-source software (OSS) projects by examining the reuse of PTMs, with a focus on how developers manage and integrate these models.

Abstract

Pre-trained models (PTMs) are machine learning models that have been trained in advance, often on large-scale data, and can be reused for new tasks, thereby reducing the need for costly training from scratch. Their widespread adoption introduces a new class of software dependency, which we term Software Dependencies 2.0, extending beyond conventional libraries to learned behaviors embodied in trained models and their associated artifacts. The integration of PTMs as software dependencies in real projects remains unclear, potentially threatening maintainability and reliability of modern software systems that increasingly rely on them. Objective: In this study, we investigate Software Dependencies 2.0 in open-source software (OSS) projects by examining the reuse of PTMs, with a focus on how developers manage and integrate these models. Specifically, we seek to understand: (1) how OSS projects structure and document their PTM dependencies; (2) what stages and organizational patterns emerge in the reuse pipelines of PTMs within these projects; and (3) the interactions among PTMs and other learned components across pipeline stages. We conduct a mixed-methods analysis of a statistically significant random sample of 401 GitHub repositories from the PeaTMOSS dataset (28,575 repositories reusing PTMs from Hugging Face and PyTorch Hub). We quantitatively examine PTM reuse by identifying patterns and qualitatively investigate how developers integrate and manage these models in practice.

Paper Structure

This paper contains 54 sections, 13 figures, 8 tables.

Figures (13)

  • Figure 1: Example of PTM reuse in the Hugging Face ecosystem. This code snippet loads a multilingual BERT model and tokenizer using the transformers library. While reuse appears straightforward at the API level, the underlying model introduces complex dependencies, including learned behavior, format constraints, and resource requirements, that are not immediately visible in code. Such Software Dependencies 2.0 represent a distinct and underexplored dimension of software reuse. Beyond dependencies, PTM reuse also involves broader aspects of integration and interaction between models, highlighting the complexity of real-world usage.
  • Figure 2: Overview of PTM reuse pipelines and their interactions within a project. PTM Reuse Pipeline A reuses models from registries such as Hugging Face and PyTorch. It may interact with other pipelines in two ways: PTM/PTM Interaction (①): interaction with PTM Reuse Pipeline B. PTM/Scratch-Trained Interaction (②): interaction with Conventional ML Pipeline. Dashed arrows indicate optional steps. Ellipses show omitted intermediate stages.
  • Figure 3: Overview of conventional ML pipelines. Stages such as data processing and post-processing are treated at a coarse level, as their fine-grained substeps (e.g., data acquisition, data preparation) vary widely and are often implicit or domain-specific.
  • Figure 4: Data Preparation: Projects were filtered and analyzed to investigate PTM usage, pipeline stages, and model interactions.
  • Figure 5: Number of unique PTMs by modality in the target 401 projects.
  • ...and 8 more figures