Biomedical Open Source Software: Crucial Packages and Hidden Heroes
Eva Maxfield Brown, Stephan Druskat, Laurent Hébert-Dufresne, James Howison, Daniel Mietchen, Andrew Nesbitt, João Felipe Pimentel, Boris Veytsman
TL;DR
This study tackles the under-recognition of foundational software in biomedical research by building a two-mode network that connects papers and open-source software across PyPI, CRAN, and Bioconductor using CZ Software Mentions and related datasets. It adopts Katz centrality with $\beta=1$ to quantify how software packages gain centrality through direct mentions and downstream dependencies, analyzing unweighted, weighted, and largest-connected-component variants. The results reveal a dense core of widely used packages, a minority of Nebraska-like central packages, and ecosystem-specific patterns, highlighting critical but under-acknowledged dependencies. The work provides actionable metrics to identify high-impact yet low-visibility software for targeted funding and maintenance, and outlines methodological enhancements like temporal dynamics and SBOM-informed analyses for more robust future insights.
Abstract
Despite the importance of scientific software for research, it is often not formally recognized and rewarded. This is especially true for foundational libraries, which are hidden below packages visible to the users (and thus doubly hidden, since even the packages directly used in research are frequently not visible in the paper). Research stakeholders like funders, infrastructure providers, and other organizations need to understand the complex network of computer programs that contemporary research relies upon. In this work, we use the CZ Software Mentions Dataset to map the upstream dependencies of software used in biomedical papers and find the packages critical to scientific software ecosystems. We propose centrality metrics for the network of software dependencies, analyze three ecosystems (PyPi, CRAN, Bioconductor), and determine the packages with the highest centrality.
