PaperRepro: Automated Computational Reproducibility Assessment for Social Science Papers

Linhao Zhang; Tong Xia; Jinghua Piao; Lizhen Cui; Yong Li

PaperRepro: Automated Computational Reproducibility Assessment for Social Science Papers

Linhao Zhang, Tong Xia, Jinghua Piao, Lizhen Cui, Yong Li

TL;DR

This work proposes PaperRepro, a novel two-stage, multi-agent approach that separates execution from evaluation of automated reproducibility assessment systems and maximizes the LM's coding capability to enable more complete result capture for evaluation.

Abstract

Computational reproducibility is essential for the credibility of scientific findings, particularly in the social sciences, where findings often inform real-world decisions. Manual reproducibility assessment is costly and time-consuming, as it is nontrivial to reproduce the reported findings using the authors' released code and data. Recent advances in large models (LMs) have inspired agent-based approaches for automated reproducibility assessment. However, existing approaches often struggle due to limited context capacity, inadequate task-specific tooling, and insufficient result capture. To address these, we propose PaperRepro, a novel two-stage, multi-agent approach that separates execution from evaluation. In the execution stage, agents execute the reproduction package and edit the code to capture reproduced results as explicit artifacts. In the evaluation stage, agents evaluate reproducibility using explicit evidence. PaperRepro assigns distinct responsibilities to agents and equips them with task-specific tools and expert prompts, mitigating context and tooling limitations. It further maximizes the LM's coding capability to enable more complete result capture for evaluation. On REPRO-Bench, a social science reproducibility assessment benchmark, PaperRepro achieves the best overall performance, with a 21.9% relative improvement in score-agreement accuracy over the strongest prior baseline. We further refine the benchmark and introduce REPRO-Bench-S, a benchmark stratified by execution difficulty for more diagnostic evaluation of automated reproducibility assessment systems. Our code and data are publicly available

PaperRepro: Automated Computational Reproducibility Assessment for Social Science Papers

TL;DR

Abstract

Paper Structure (37 sections, 5 equations, 11 figures, 7 tables)

This paper contains 37 sections, 5 equations, 11 figures, 7 tables.

Introduction
Background
Problem definition.
Practical challenges for automation.
PaperRepro
Artifact-driven Execution Stage
Evidence-grounded Evaluation Stage
Key Design
Experiments
Experimental Setup
Benchmark.
Our approach and Baselines.
Metrics.
Experimental Results and Analysis
Main Results.
...and 22 more sections

Figures (11)

Figure 1: Pipeline of computational reproducibility assessment. A challenging process of executing the reproduction package and verifying the reproduced results against the paper’s reported results.
Figure 2: Three representative reproduction scenarios that commonly hinder automated reproducibility assessment in social science.
Figure 3: Overview of PaperRepro. It follows a two-stage pipeline: (a) an artifact-driven execution stage that executes the reproduction package and captures reproduced artifacts, and (b) an evidence-grounded evaluation stage that aligns artifacts with the paper to produce the final score and report.
Figure 4: Score-level confusion heatmaps for PaperRepro. Left: normalized distribution over scores 1--4. Right: scores 2--4 merged, consistent with the executability metric.
Figure 5: Distribution of failure types for cases where PaperRepro predicts score 1, but the true score is in 2--4.
...and 6 more figures

PaperRepro: Automated Computational Reproducibility Assessment for Social Science Papers

TL;DR

Abstract

PaperRepro: Automated Computational Reproducibility Assessment for Social Science Papers

Authors

TL;DR

Abstract

Table of Contents

Figures (11)