Table of Contents
Fetching ...

Cross-ecosystem categorization: A manual-curation protocol for the categorization of Java Maven libraries along Python PyPI Topics

Ranindya Paramitha, Yuan Feng, Fabio Massacci, Carlos E. Budde

TL;DR

Cross-ecosystem studies require comparable functional labels for libraries across ecosystems, yet ecosystem-specific taxonomies hinder fair comparisons. The authors present a human-guided protocol to map libraries from any ecosystem to PyPI Topic categories, providing a language-agnostic functional fingerprint with explicit guidelines and roles. They demonstrate the approach on 256 Java/Maven libraries with high or critical CVEs, producing open data and artefacts to enable replication and to ground future ML/NLP scaling. The arbitration and potential class-revision steps offer a practical path to reconcile functional categories with coarse-grained exposure classifications, supporting reliable, cross-ecosystem empirical software-security research.

Abstract

Context: Software of different functional categories, such as text processing vs. networking, has different profiles in terms of metrics like security and updates. Using popularity to compare e.g. Java vs. Python libraries might give a skewed perspective, as the categories of the most popular software vary from one ecosystem to the next. How can one compare libraries datasets across software ecosystems, when not even the category names are uniform among them? Objective: We study how to generate a language-agnostic categorisation of software by functional purpose, that enables cross-ecosystem studies of libraries datasets. This provides the functional fingerprint information needed for software metrics comparisons. Method: We designed and implemented a human-guided protocol to categorise libraries from software ecosystems. Category names mirror PyPI Topic classifiers, but the protocol is generic and can be applied to any ecosystem. We demonstrate it by categorising 256 Java/Maven libraries with severe security vulnerabilities. Results: The protocol allows three or more people to categorise any number of libraries. The categorisation produced is functional-oriented and language-agnostic. The Java/Maven dataset demonstration resulted in a majority of Internet-oriented libraries, coherent with its selection by severe vulnerabilities. To allow replication and updates, we make the dataset and the protocol individual steps available as open data. Conclusions: Libraries categorisation by functional purpose is feasible with our protocol, which produced the fingerprint of a 256-libraries Java dataset. While this was labour intensive, humans excel in the required inference tasks, so full automation of the process is not envisioned. However, results can provide the ground truth needed for machine learning in large-scale cross-ecosystem empirical studies.

Cross-ecosystem categorization: A manual-curation protocol for the categorization of Java Maven libraries along Python PyPI Topics

TL;DR

Cross-ecosystem studies require comparable functional labels for libraries across ecosystems, yet ecosystem-specific taxonomies hinder fair comparisons. The authors present a human-guided protocol to map libraries from any ecosystem to PyPI Topic categories, providing a language-agnostic functional fingerprint with explicit guidelines and roles. They demonstrate the approach on 256 Java/Maven libraries with high or critical CVEs, producing open data and artefacts to enable replication and to ground future ML/NLP scaling. The arbitration and potential class-revision steps offer a practical path to reconcile functional categories with coarse-grained exposure classifications, supporting reliable, cross-ecosystem empirical software-security research.

Abstract

Context: Software of different functional categories, such as text processing vs. networking, has different profiles in terms of metrics like security and updates. Using popularity to compare e.g. Java vs. Python libraries might give a skewed perspective, as the categories of the most popular software vary from one ecosystem to the next. How can one compare libraries datasets across software ecosystems, when not even the category names are uniform among them? Objective: We study how to generate a language-agnostic categorisation of software by functional purpose, that enables cross-ecosystem studies of libraries datasets. This provides the functional fingerprint information needed for software metrics comparisons. Method: We designed and implemented a human-guided protocol to categorise libraries from software ecosystems. Category names mirror PyPI Topic classifiers, but the protocol is generic and can be applied to any ecosystem. We demonstrate it by categorising 256 Java/Maven libraries with severe security vulnerabilities. Results: The protocol allows three or more people to categorise any number of libraries. The categorisation produced is functional-oriented and language-agnostic. The Java/Maven dataset demonstration resulted in a majority of Internet-oriented libraries, coherent with its selection by severe vulnerabilities. To allow replication and updates, we make the dataset and the protocol individual steps available as open data. Conclusions: Libraries categorisation by functional purpose is feasible with our protocol, which produced the fingerprint of a 256-libraries Java dataset. While this was labour intensive, humans excel in the required inference tasks, so full automation of the process is not envisioned. However, results can provide the ground truth needed for machine learning in large-scale cross-ecosystem empirical studies.
Paper Structure (19 sections, 1 equation, 2 figures, 5 tables)

This paper contains 19 sections, 1 equation, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Distribution of libraries along $\textsc{\larger{p}}$y$\textsc{\larger{pi}}$ Topics in two ecosystems: Java/Maven and Python/$\textsc{\larger{p}}$y$\textsc{\larger{pi}}$
  • Figure : Partition of categorised libraries into classes according to their Internet orientation---see \ref{['sec:protocol:assess:class']}