Table of Contents
Fetching ...

Don't mention it: An approach to assess challenges to using software mentions for citation and discoverability research

Stephan Druskat, Neil P. Chue Hong, Sammie Buzzard, Olexandr Konovalov, Patrick Kornek

TL;DR

An approach to assess the usability of software mention datasets for research on research software, which includes sampling and data preparation, manual annotation for quality and mention characteristics, and annotation analysis was applied to two datasets.

Abstract

Datasets collecting software mentions from scholarly publications can potentially be used for research into the software that has been used in the published research, as well as into the practice of software citation. Recently, new software mention datasets with different characteristics have been published. We present an approach to assess the usability of such datasets for research on research software. Our approach includes sampling and data preparation, manual annotation for quality and mention characteristics, and annotation analysis. We applied it to two software mention datasets for evaluation based on qualitative observation. Doing this, we were able to find challenges to working with the selected datasets to do research. Main issues refer to the structure of the dataset, the quality of the extracted mentions (54% and 23% of mentions respectively are not to software), and software accessibility. While one dataset does not provide links to mentioned software at all, the other does so in a way that can impede quantitative research endeavors: (1) Links may come from different sources and each point to different software for the same mention. (2) The quality of the automatically retrieved links is generally poor (in our sample, 65.4% link the wrong software). (3) Links exist only for a small subset (in our sample, 20.5%) of mentions, which may lead to skewed or disproportionate samples. However, the greatest challenge and underlying issue in working with software mention datasets is the still suboptimal practice of software citation: Software should not be mentioned, it should be cited following the software citation principles.

Don't mention it: An approach to assess challenges to using software mentions for citation and discoverability research

TL;DR

An approach to assess the usability of software mention datasets for research on research software, which includes sampling and data preparation, manual annotation for quality and mention characteristics, and annotation analysis was applied to two datasets.

Abstract

Datasets collecting software mentions from scholarly publications can potentially be used for research into the software that has been used in the published research, as well as into the practice of software citation. Recently, new software mention datasets with different characteristics have been published. We present an approach to assess the usability of such datasets for research on research software. Our approach includes sampling and data preparation, manual annotation for quality and mention characteristics, and annotation analysis. We applied it to two software mention datasets for evaluation based on qualitative observation. Doing this, we were able to find challenges to working with the selected datasets to do research. Main issues refer to the structure of the dataset, the quality of the extracted mentions (54% and 23% of mentions respectively are not to software), and software accessibility. While one dataset does not provide links to mentioned software at all, the other does so in a way that can impede quantitative research endeavors: (1) Links may come from different sources and each point to different software for the same mention. (2) The quality of the automatically retrieved links is generally poor (in our sample, 65.4% link the wrong software). (3) Links exist only for a small subset (in our sample, 20.5%) of mentions, which may lead to skewed or disproportionate samples. However, the greatest challenge and underlying issue in working with software mention datasets is the still suboptimal practice of software citation: Software should not be mentioned, it should be cited following the software citation principles.
Paper Structure (17 sections, 8 figures, 8 tables)

This paper contains 17 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Distribution of mention counts over the complete exploded CSM dataset. x: distinct software mentions (log), y: sum of mentions for distinct software.
  • Figure 2: Distribution of mention counts over our sample from the CSM dataset. x: distinct software mentions (log), y: sum of mentions for distinct software.
  • Figure 3: Distribution of mention counts over the complete filtered CZI dataset. x: distinct software mentions (log), y: sum of mentions for distinct software.
  • Figure 4: Distribution of mention counts over our 100k sample from the CZI dataset. x: distinct software mentions (log), y: sum of mentions for distinct software.
  • Figure 5: Visualization of the complete assessment workflow.
  • ...and 3 more figures