Table of Contents
Fetching ...

Can citations tell us about a paper's reproducibility? A case study of machine learning papers

Rochana R. Obadage, Sarah M. Rajtmajer, Jian Wu

TL;DR

The paper investigates whether downstream citation-context sentiment can signal reproducibility in ML/AI papers. It develops an aspect-based sentiment framework using two classifiers, including DistilBERT-based and a hierarchical model, trained on a ground-truth set of reproducibility-related contexts and evaluated against baselines. The study defines an extended reproducibility score $rs\_score$ and normalizes citation-context sentiment counts to explore correlations with $rs\_score$, reporting that higher reproducibility scores align with more positive and fewer negative citation-context sentiments. If validated on larger datasets, this approach could enable scalable, surrogate assessments of reproducibility trends across vast ML literature when direct replication is impractical.

Abstract

The iterative character of work in machine learning (ML) and artificial intelligence (AI) and reliance on comparisons against benchmark datasets emphasize the importance of reproducibility in that literature. Yet, resource constraints and inadequate documentation can make running replications particularly challenging. Our work explores the potential of using downstream citation contexts as a signal of reproducibility. We introduce a sentiment analysis framework applied to citation contexts from papers involved in Machine Learning Reproducibility Challenges in order to interpret the positive or negative outcomes of reproduction attempts. Our contributions include training classifiers for reproducibility-related contexts and sentiment analysis, and exploring correlations between citation context sentiment and reproducibility scores. Study data, software, and an artifact appendix are publicly available at https://github.com/lamps-lab/ccair-ai-reproducibility .

Can citations tell us about a paper's reproducibility? A case study of machine learning papers

TL;DR

The paper investigates whether downstream citation-context sentiment can signal reproducibility in ML/AI papers. It develops an aspect-based sentiment framework using two classifiers, including DistilBERT-based and a hierarchical model, trained on a ground-truth set of reproducibility-related contexts and evaluated against baselines. The study defines an extended reproducibility score and normalizes citation-context sentiment counts to explore correlations with , reporting that higher reproducibility scores align with more positive and fewer negative citation-context sentiments. If validated on larger datasets, this approach could enable scalable, surrogate assessments of reproducibility trends across vast ML literature when direct replication is impractical.

Abstract

The iterative character of work in machine learning (ML) and artificial intelligence (AI) and reliance on comparisons against benchmark datasets emphasize the importance of reproducibility in that literature. Yet, resource constraints and inadequate documentation can make running replications particularly challenging. Our work explores the potential of using downstream citation contexts as a signal of reproducibility. We introduce a sentiment analysis framework applied to citation contexts from papers involved in Machine Learning Reproducibility Challenges in order to interpret the positive or negative outcomes of reproduction attempts. Our contributions include training classifiers for reproducibility-related contexts and sentiment analysis, and exploring correlations between citation context sentiment and reproducibility scores. Study data, software, and an artifact appendix are publicly available at https://github.com/lamps-lab/ccair-ai-reproducibility .
Paper Structure (12 sections, 2 equations, 4 figures, 4 tables)

This paper contains 12 sections, 2 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Examples of citation context with different reproducibility sentiments.
  • Figure 2: A schematic illustration of the data reduction and processing workflow.
  • Figure 3: Normalized citation context sentiment counts vs. reproducibility scores using M6 (left) and M7 (right).
  • Figure 4: Normalized positive and negative citation context counts vs. rs_scores using M6 (left) and M7 (right).