Table of Contents
Fetching ...

CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis

Rochana R. Obadage, Sarah M. Rajtmajer, Jian Wu

TL;DR

The paper tackles predicting reproducibility-oriented sentiment in scholarly citation contexts by introducing CC30k, a large-scale dataset of 30,734 contexts labeled as Positive, Negative, or Neutral. It implements a robust, multi-stage data pipeline—from collecting contexts around reproducibility studies to cleansing, crowdsourcing, and augmenting negatives—achieving labeling accuracy around 94%. The authors demonstrate that off-the-shelf sentiment tools underperform on this domain but that fine-tuning LLMs on CC30k yields consistent improvements, with GPT-4o in a retrieval-augmented setting reaching the highest reported gains. CC30k enables scalable reproducibility assessments in AI research and is publicly available to support future work in bibliometrics and reproducibility-aware NLP.

Abstract

Sentiments about the reproducibility of cited papers in downstream literature offer community perspectives and have shown as a promising signal of the actual reproducibility of published findings. To train effective models to effectively predict reproducibility-oriented sentiments and further systematically study their correlation with reproducibility, we introduce the CC30k dataset, comprising a total of 30,734 citation contexts in machine learning papers. Each citation context is labeled with one of three reproducibility-oriented sentiment labels: Positive, Negative, or Neutral, reflecting the cited paper's perceived reproducibility or replicability. Of these, 25,829 are labeled through crowdsourcing, supplemented with negatives generated through a controlled pipeline to counter the scarcity of negative labels. Unlike traditional sentiment analysis datasets, CC30k focuses on reproducibility-oriented sentiments, addressing a research gap in resources for computational reproducibility studies. The dataset was created through a pipeline that includes robust data cleansing, careful crowd selection, and thorough validation. The resulting dataset achieves a labeling accuracy of 94%. We then demonstrated that the performance of three large language models significantly improves on the reproducibility-oriented sentiment classification after fine-tuning using our dataset. The dataset lays the foundation for large-scale assessments of the reproducibility of machine learning papers. The CC30k dataset and the Jupyter notebooks used to produce and analyze the dataset are publicly available at https://github.com/lamps-lab/CC30k .

CC30k: A Citation Contexts Dataset for Reproducibility-Oriented Sentiment Analysis

TL;DR

The paper tackles predicting reproducibility-oriented sentiment in scholarly citation contexts by introducing CC30k, a large-scale dataset of 30,734 contexts labeled as Positive, Negative, or Neutral. It implements a robust, multi-stage data pipeline—from collecting contexts around reproducibility studies to cleansing, crowdsourcing, and augmenting negatives—achieving labeling accuracy around 94%. The authors demonstrate that off-the-shelf sentiment tools underperform on this domain but that fine-tuning LLMs on CC30k yields consistent improvements, with GPT-4o in a retrieval-augmented setting reaching the highest reported gains. CC30k enables scalable reproducibility assessments in AI research and is publicly available to support future work in bibliometrics and reproducibility-aware NLP.

Abstract

Sentiments about the reproducibility of cited papers in downstream literature offer community perspectives and have shown as a promising signal of the actual reproducibility of published findings. To train effective models to effectively predict reproducibility-oriented sentiments and further systematically study their correlation with reproducibility, we introduce the CC30k dataset, comprising a total of 30,734 citation contexts in machine learning papers. Each citation context is labeled with one of three reproducibility-oriented sentiment labels: Positive, Negative, or Neutral, reflecting the cited paper's perceived reproducibility or replicability. Of these, 25,829 are labeled through crowdsourcing, supplemented with negatives generated through a controlled pipeline to counter the scarcity of negative labels. Unlike traditional sentiment analysis datasets, CC30k focuses on reproducibility-oriented sentiments, addressing a research gap in resources for computational reproducibility studies. The dataset was created through a pipeline that includes robust data cleansing, careful crowd selection, and thorough validation. The resulting dataset achieves a labeling accuracy of 94%. We then demonstrated that the performance of three large language models significantly improves on the reproducibility-oriented sentiment classification after fine-tuning using our dataset. The dataset lays the foundation for large-scale assessments of the reproducibility of machine learning papers. The CC30k dataset and the Jupyter notebooks used to produce and analyze the dataset are publicly available at https://github.com/lamps-lab/CC30k .

Paper Structure

This paper contains 20 sections, 7 figures, 9 tables.

Figures (7)

  • Figure 1: Examples of citation context with different reproducibility-oriented sentiments acm-rep-24.
  • Figure 2: The data production process.
  • Figure 3: Crowdsourcing task interface using Crowd HTML elements.
  • Figure 4: The pipeline of augmenting negative citation contexts; AML: augmented and machine labeled, AHV: augmented and human-validated.
  • Figure 5: Distribution of citation contexts, citing papers, and sentiment proportions across cited papers (crowdsourced portion): (a) Citation context count distribution across cited papers, (b) Citing paper count distribution across cited papers, (c) Distribution of the fraction of positive and negative citation contexts among all citation contexts in a paper.
  • ...and 2 more figures