Can citations tell us about a paper's reproducibility? A case study of machine learning papers

Rochana R. Obadage; Sarah M. Rajtmajer; Jian Wu

Can citations tell us about a paper's reproducibility? A case study of machine learning papers

Rochana R. Obadage, Sarah M. Rajtmajer, Jian Wu

TL;DR

The paper investigates whether downstream citation-context sentiment can signal reproducibility in ML/AI papers. It develops an aspect-based sentiment framework using two classifiers, including DistilBERT-based and a hierarchical model, trained on a ground-truth set of reproducibility-related contexts and evaluated against baselines. The study defines an extended reproducibility score $rs\_score$ and normalizes citation-context sentiment counts to explore correlations with $rs\_score$, reporting that higher reproducibility scores align with more positive and fewer negative citation-context sentiments. If validated on larger datasets, this approach could enable scalable, surrogate assessments of reproducibility trends across vast ML literature when direct replication is impractical.

Abstract

The iterative character of work in machine learning (ML) and artificial intelligence (AI) and reliance on comparisons against benchmark datasets emphasize the importance of reproducibility in that literature. Yet, resource constraints and inadequate documentation can make running replications particularly challenging. Our work explores the potential of using downstream citation contexts as a signal of reproducibility. We introduce a sentiment analysis framework applied to citation contexts from papers involved in Machine Learning Reproducibility Challenges in order to interpret the positive or negative outcomes of reproduction attempts. Our contributions include training classifiers for reproducibility-related contexts and sentiment analysis, and exploring correlations between citation context sentiment and reproducibility scores. Study data, software, and an artifact appendix are publicly available at https://github.com/lamps-lab/ccair-ai-reproducibility .

Can citations tell us about a paper's reproducibility? A case study of machine learning papers

TL;DR

and normalizes citation-context sentiment counts to explore correlations with

, reporting that higher reproducibility scores align with more positive and fewer negative citation-context sentiments. If validated on larger datasets, this approach could enable scalable, surrogate assessments of reproducibility trends across vast ML literature when direct replication is impractical.

Abstract

Paper Structure (12 sections, 2 equations, 4 figures, 4 tables)

This paper contains 12 sections, 2 equations, 4 figures, 4 tables.

Introduction
Related Work
Dataset
Reproducibility Studies
Reproducibility Score Calculation
Citation Context Collection
Building the Ground Truth
Sentiment analysis
Results
Sentiment Analysis
Citation Context Sentiments vs. Reproducibility Scores
Discussion and Conclusion

Figures (4)

Figure 1: Examples of citation context with different reproducibility sentiments.
Figure 2: A schematic illustration of the data reduction and processing workflow.
Figure 3: Normalized citation context sentiment counts vs. reproducibility scores using M6 (left) and M7 (right).
Figure 4: Normalized positive and negative citation context counts vs. rs_scores using M6 (left) and M7 (right).

Can citations tell us about a paper's reproducibility? A case study of machine learning papers

TL;DR

Abstract

Can citations tell us about a paper's reproducibility? A case study of machine learning papers

Authors

TL;DR

Abstract

Table of Contents

Figures (4)