Table of Contents
Fetching ...

Biomedical Open Source Software: Crucial Packages and Hidden Heroes

Eva Maxfield Brown, Stephan Druskat, Laurent Hébert-Dufresne, James Howison, Daniel Mietchen, Andrew Nesbitt, João Felipe Pimentel, Boris Veytsman

TL;DR

This study tackles the under-recognition of foundational software in biomedical research by building a two-mode network that connects papers and open-source software across PyPI, CRAN, and Bioconductor using CZ Software Mentions and related datasets. It adopts Katz centrality with $\beta=1$ to quantify how software packages gain centrality through direct mentions and downstream dependencies, analyzing unweighted, weighted, and largest-connected-component variants. The results reveal a dense core of widely used packages, a minority of Nebraska-like central packages, and ecosystem-specific patterns, highlighting critical but under-acknowledged dependencies. The work provides actionable metrics to identify high-impact yet low-visibility software for targeted funding and maintenance, and outlines methodological enhancements like temporal dynamics and SBOM-informed analyses for more robust future insights.

Abstract

Despite the importance of scientific software for research, it is often not formally recognized and rewarded. This is especially true for foundational libraries, which are hidden below packages visible to the users (and thus doubly hidden, since even the packages directly used in research are frequently not visible in the paper). Research stakeholders like funders, infrastructure providers, and other organizations need to understand the complex network of computer programs that contemporary research relies upon. In this work, we use the CZ Software Mentions Dataset to map the upstream dependencies of software used in biomedical papers and find the packages critical to scientific software ecosystems. We propose centrality metrics for the network of software dependencies, analyze three ecosystems (PyPi, CRAN, Bioconductor), and determine the packages with the highest centrality.

Biomedical Open Source Software: Crucial Packages and Hidden Heroes

TL;DR

This study tackles the under-recognition of foundational software in biomedical research by building a two-mode network that connects papers and open-source software across PyPI, CRAN, and Bioconductor using CZ Software Mentions and related datasets. It adopts Katz centrality with to quantify how software packages gain centrality through direct mentions and downstream dependencies, analyzing unweighted, weighted, and largest-connected-component variants. The results reveal a dense core of widely used packages, a minority of Nebraska-like central packages, and ecosystem-specific patterns, highlighting critical but under-acknowledged dependencies. The work provides actionable metrics to identify high-impact yet low-visibility software for targeted funding and maintenance, and outlines methodological enhancements like temporal dynamics and SBOM-informed analyses for more robust future insights.

Abstract

Despite the importance of scientific software for research, it is often not formally recognized and rewarded. This is especially true for foundational libraries, which are hidden below packages visible to the users (and thus doubly hidden, since even the packages directly used in research are frequently not visible in the paper). Research stakeholders like funders, infrastructure providers, and other organizations need to understand the complex network of computer programs that contemporary research relies upon. In this work, we use the CZ Software Mentions Dataset to map the upstream dependencies of software used in biomedical papers and find the packages critical to scientific software ecosystems. We propose centrality metrics for the network of software dependencies, analyze three ecosystems (PyPi, CRAN, Bioconductor), and determine the packages with the highest centrality.
Paper Structure (7 sections, 3 figures, 5 tables)

This paper contains 7 sections, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Classification of software packages inspired by Stokes' classification system in Stokes1997:PasteursQuadrant. “Nebraska” packages are software projects which have few mentions in research articles, but are highly central in a dependency network. “Pasteur” packages are both highly visible with lots of mentions and are highly central in a dependency network.
  • Figure 2: (a) Network visualization of software packages from three ecosystems (from CRAN in green, PyPI in blue, and Bioconductor in pink) connected through their dependencies within their ecosystem and interconnected through papers that mention them. We label the top 3 most central packages in each ecosystem: ggplot2wickham2011ggplot2, SAMravikumar2009SAM, and PRISMAkrueger2012prisma for CRAN, velvetvelvet, tophat and pymoldelano2002pymol for PyPI and DeSeq2love2014deseq2, edgeRrobinson2010edger and limmasmyth2005limmaritchie2015limma for Bioconductor. The core of the network is dominated by CRAN and PyPI dependencies, despite the fact that three of the five most central packages come from Bioconductor. (b) The top part of the above network, with papers added (in grey) to illustrate how PRISMAkrueger2012prisma can be central due to many mentions in papers.
  • Figure 3: Distribution of packages by Katz centrality and counts of their mentions in papers. Katz centrality is calculated for an unweighted graph, for a weighted graph with all nodes, or just for the largest connected cluster (LCC) for each ecosystem. In the calculations, we assumed $\beta=1$.