A multi-level analysis of data quality for formal software citation

David Schindler; Tazin Hossain; Sascha Spors; Frank Krüger

A multi-level analysis of data quality for formal software citation

David Schindler, Tazin Hossain, Sascha Spors, Frank Krüger

TL;DR

This work assesses the data quality of formal software citations across the entire data lifecycle by manually annotating a high-quality SoMeSci-derived corpus. It analyzes the types of resources cited, traces formal citations that occur without in-text mentions, measures completeness of Direct Citations, and evaluates how well publishers and bibliographic databases (Semantic Scholar and Crossref) represent these citations using detailed metadata and alluvial-plot visualizations. Key findings show software articles are the most common citation target and while direct software citations can uniquely identify software and its code base, substantial gaps in metadata and database representation limit large-scale software impact analyses. The study highlights the need for better, machine-readable modeling of software in bibliographic infrastructures and suggests dual-citation practices to ensure reproducibility and credit, urging data providers to adapt to the specifics of software citation.

Abstract

Software is a central part of modern science, and knowledge of its use is crucial for the scientific community with respect to reproducibility and attribution of its developers. Several studies have investigated in-text mentions of software and its quality, while the quality of formal software citations has only been analyzed superficially. This study performs an in-depth evaluation of formal software citation based on a set of manually annotated software references. It examines which resources are cited for software usage, to what extend they allow proper identification of software and its specific version, how this information is made available by scientific publishers, and how well it is represented in large-scale bibliographic databases. The results show that software articles are the most cited resource for software, while direct software citations are better suited for identification of software versions. Moreover, we found current practices by both, publishers and bibliographic databases, to be unsuited to represent these direct software citations, hindering large-scale analyses such as assessing software impact. We argue that current practices for representing software citations -- the recommended way to cite software by current citation standards -- stand in the way of their adaption by the scientific community, and urge providers of bibliographic data to explicitly model scientific software.

A multi-level analysis of data quality for formal software citation

TL;DR

Abstract

Paper Structure (34 sections, 16 figures, 1 table)

This paper contains 34 sections, 16 figures, 1 table.

Introduction
Related Work
Analyses
Citations Resource Types
Software Articles
Software Manuals
Websites
Other
Formal Software Reference without in-text Software Mentions
Direct Citation Completeness
Database Accuracy
Confidence Intervals (CI)
Dataset
Annotation
Software Citation Types
...and 19 more sections

Figures (16)

Figure 1: SoMeSci annotation example where the software R is mentioned in-text and the citation "[30]" is associated to the mention. Example from SoMeSci.
Figure 2: Reference information to the SoMeSci annotation #30 of article PMC5690316 given in Figure \ref{['fig:somesci_ex']}. The reference is an example of a Software Article.
Figure 3: Flowchart illustrating the annotation and analyses steps to investigate the research questions outlined in Section \ref{['sec:analyses']}.
Figure 4: Semantic Scholar reference entry corresponding to the JATS entry in Listing \ref{['list:software_direct_xml']} [ID: PMC2134966, Semantic Scholar ID: 1116831, del2007modular]. Meta-data is highlighted for: , , , , . Meta-data represented in an manner, for which no label is provided, and meta-data represented with the are marked. Information on the and from the original publisher information is missing.
Figure 5: Crossref reference entry corresponding to the JATS entry in Listing \ref{['list:software_direct_xml']} [ID: PMC2134966, del2007modular]. Meta-data is highlighted for: , , , , , , and . Meta-data represented in an manner is marked.
...and 11 more figures

A multi-level analysis of data quality for formal software citation

TL;DR

Abstract

A multi-level analysis of data quality for formal software citation

Authors

TL;DR

Abstract

Table of Contents

Figures (16)