Table of Contents
Fetching ...

PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages

Kai Gao, Weiwei Xu, Wenhao Yang, Minghui Zhou

TL;DR

PyRadar tackles the problem of missing or incorrect source repository links for PyPI releases by integrating metadata extraction, a validator based on six crafted features, and a hash‑based, source‑code‑driven retrieval via World of Code. The authors first quantify metadata tool performance and phantom file differences through a large‑scale empirical study, then design a three‑component framework that achieves 72.1% metadata retrieval, an AUC of up to 0.995 for validation, and 90.2% retrieval from source code with 0.970 accuracy. They show that the two approaches are complementary, and that PyRadar can achieve an overall accuracy of 0.88 on a curated dataset of correct and incorrect links, enabling more reliable use and monitoring of PyPI packages. A replication package is provided to facilitate adoption and further research in repository provenance for package registries.

Abstract

A package's source code repository records the development history of the package, providing indispensable information for the use and risk monitoring of the package. However, a package release often misses its source code repository due to the separation of the package's development platform from its distribution platform. Existing tools retrieve the release's repository information from its metadata, which suffers from two limitations: the metadata may not contain or contain wrong information. Our analysis shows that existing tools can only retrieve repository information for up to 70.5% of PyPI releases. To address the limitations, this paper proposes PyRadar, a novel framework that utilizes the metadata and source distribution to retrieve and validate the repository information for PyPI releases. We start with an empirical study to compare four existing tools on 4,227,425 PyPI releases and analyze phantom files (files appearing in the release's distribution but not in the release's repository) in 14,375 correct package-repository links and 2,064 incorrect links. Based on the findings, we design PyRadar with three components, i.e., Metadata-based Retriever, Source Code Repository Validator, and Source Code-based Retriever. In particular, the Metadata-based Retriever combines best practices of existing tools and successfully retrieves repository information from the metadata for 72.1% of PyPI releases. The Source Code Repository Validator applies common machine learning algorithms on six crafted features and achieves an AUC of up to 0.995. The Source Code-based Retriever queries World of Code with the SHA-1 hashes of all Python files in the release's source distribution and retrieves repository information for 90.2% of packages in our dataset with an accuracy of 0.970. Both practitioners and researchers can employ the PyRadar to better use PyPI packages.

PyRadar: Towards Automatically Retrieving and Validating Source Code Repository Information for PyPI Packages

TL;DR

PyRadar tackles the problem of missing or incorrect source repository links for PyPI releases by integrating metadata extraction, a validator based on six crafted features, and a hash‑based, source‑code‑driven retrieval via World of Code. The authors first quantify metadata tool performance and phantom file differences through a large‑scale empirical study, then design a three‑component framework that achieves 72.1% metadata retrieval, an AUC of up to 0.995 for validation, and 90.2% retrieval from source code with 0.970 accuracy. They show that the two approaches are complementary, and that PyRadar can achieve an overall accuracy of 0.88 on a curated dataset of correct and incorrect links, enabling more reliable use and monitoring of PyPI packages. A replication package is provided to facilitate adoption and further research in repository provenance for package registries.

Abstract

A package's source code repository records the development history of the package, providing indispensable information for the use and risk monitoring of the package. However, a package release often misses its source code repository due to the separation of the package's development platform from its distribution platform. Existing tools retrieve the release's repository information from its metadata, which suffers from two limitations: the metadata may not contain or contain wrong information. Our analysis shows that existing tools can only retrieve repository information for up to 70.5% of PyPI releases. To address the limitations, this paper proposes PyRadar, a novel framework that utilizes the metadata and source distribution to retrieve and validate the repository information for PyPI releases. We start with an empirical study to compare four existing tools on 4,227,425 PyPI releases and analyze phantom files (files appearing in the release's distribution but not in the release's repository) in 14,375 correct package-repository links and 2,064 incorrect links. Based on the findings, we design PyRadar with three components, i.e., Metadata-based Retriever, Source Code Repository Validator, and Source Code-based Retriever. In particular, the Metadata-based Retriever combines best practices of existing tools and successfully retrieves repository information from the metadata for 72.1% of PyPI releases. The Source Code Repository Validator applies common machine learning algorithms on six crafted features and achieves an AUC of up to 0.995. The Source Code-based Retriever queries World of Code with the SHA-1 hashes of all Python files in the release's source distribution and retrieves repository information for 90.2% of packages in our dataset with an accuracy of 0.970. Both practitioners and researchers can employ the PyRadar to better use PyPI packages.
Paper Structure (30 sections, 4 figures, 9 tables, 2 algorithms)

This paper contains 30 sections, 4 figures, 9 tables, 2 algorithms.

Figures (4)

  • Figure 1: Overview of this study
  • Figure 2: GitHub dependency graph page of the repository https://github.com/numpy/numpy/network/dependents
  • Figure 3: Distribution of the number of phantom files and phantom Python files in correct and incorrect links.
  • Figure 4: Overview of the PyRadar framework