Table of Contents
Fetching ...

OpenDORS: A dataset of openly referenced open research software

Stephan Druskat, Lars Grunske

TL;DR

OpenDORS addresses the lack of large-scale empirical data on research software by constructing a reproducible dataset that links openly referenced software in literature to actual source code repositories. The authors develop a URL-component mining pipeline and accompanying open-source tools to assemble 134,352 research software projects and 134,154 repositories, including version and metadata details. The dataset enables large-scale RSE research, reproducibility studies, and cross-domain comparisons, with future work to expand sources and improve versioning and archival integration. This resource aims to catalyze empirical understanding of research software practice and lifecycle at scale.

Abstract

In many academic disciplines, software is created during the research process or for a research purpose. The crucial role of software for research is increasingly acknowledged. The application of software engineering to research software has been formalized as research software engineering, to create better software that enables better research. Despite this, large-scale studies of research software and its development are still lacking. To enable such studies, we present a dataset of 134,352 unique open research software projects and 134,154 source code repositories referenced in open access literature. Each dataset record identifies the referencing publication and lists source code repositories of the software project. For 122,425 source code repositories, the dataset provides metadata on latest versions, license information, programming languages and descriptive metadata files. We summarize the distributions of these features in the dataset and describe additional software metadata that extends the dataset in future work. Finally, we suggest examples of research that could use the dataset to develop a better understanding of research software practice in RSE research.

OpenDORS: A dataset of openly referenced open research software

TL;DR

OpenDORS addresses the lack of large-scale empirical data on research software by constructing a reproducible dataset that links openly referenced software in literature to actual source code repositories. The authors develop a URL-component mining pipeline and accompanying open-source tools to assemble 134,352 research software projects and 134,154 repositories, including version and metadata details. The dataset enables large-scale RSE research, reproducibility studies, and cross-domain comparisons, with future work to expand sources and improve versioning and archival integration. This resource aims to catalyze empirical understanding of research software practice and lifecycle at scale.

Abstract

In many academic disciplines, software is created during the research process or for a research purpose. The crucial role of software for research is increasingly acknowledged. The application of software engineering to research software has been formalized as research software engineering, to create better software that enables better research. Despite this, large-scale studies of research software and its development are still lacking. To enable such studies, we present a dataset of 134,352 unique open research software projects and 134,154 source code repositories referenced in open access literature. Each dataset record identifies the referencing publication and lists source code repositories of the software project. For 122,425 source code repositories, the dataset provides metadata on latest versions, license information, programming languages and descriptive metadata files. We summarize the distributions of these features in the dataset and describe additional software metadata that extends the dataset in future work. Finally, we suggest examples of research that could use the dataset to develop a better understanding of research software practice in RSE research.

Paper Structure

This paper contains 7 sections, 3 figures, 1 table.

Figures (3)

  • Figure 1: Data provenance over the dataset construction process.
  • Figure 2: References (log scale) to repositories per data source. Declines in PMC/ArXiv references reflect publications available in escamilla_extract-urls.
  • Figure 3: Repository counts for programming languages (top 20).