Table of Contents
Fetching ...

Scaling Reproducibility: An AI-Assisted Workflow for Large-Scale Reanalysis

Yiqing Xu, Leo Yang Yang

TL;DR

An agentic AI workflow that addresses this execution bottleneck while preserving scientific rigor is developed and evaluated, which substantially lowers the cost of executing established empirical protocols and can be adapted in empirical settings where analytic templates and norms of transparency are well established.

Abstract

Reproducibility is central to research credibility, yet large-scale reanalysis of empricial data remains costly because replication packages vary widely in structure, software environment, and documentation. We develop and evaluate an agentic AI workflow that addresses this execution bottleneck while preserving scientific rigor. The system separates scientific reasoning from computational execution: researchers design fixed diagnostic templates, and the workflow automates the acquisition, harmonization, and execution of replication materials using pre-specified, version-controlled code. A structured knowledge layer records resolved failure patterns, enabling adaptation across heterogeneous studies while keeping each pipeline version transparent and stable. We evaluate this workflow on 92 instrumental variable (IV) studies, including 67 with manually verified reproducible 2SLS estimates and 25 newly published IV studies under identical criteria. For each paper, we analyze up to three two-stage least squares (2SLS) specifications, totaling 215. Across the 92 papers, the system achieves 87% end-to-end success overall. Conditional on accessible data and code, reproducibility is 100% at both the paper and specification levels. The framework substantially lowers the cost of executing established empirical protocols and can be adapted in empirical settings where analytic templates and norms of transparency are well established.

Scaling Reproducibility: An AI-Assisted Workflow for Large-Scale Reanalysis

TL;DR

An agentic AI workflow that addresses this execution bottleneck while preserving scientific rigor is developed and evaluated, which substantially lowers the cost of executing established empirical protocols and can be adapted in empirical settings where analytic templates and norms of transparency are well established.

Abstract

Reproducibility is central to research credibility, yet large-scale reanalysis of empricial data remains costly because replication packages vary widely in structure, software environment, and documentation. We develop and evaluate an agentic AI workflow that addresses this execution bottleneck while preserving scientific rigor. The system separates scientific reasoning from computational execution: researchers design fixed diagnostic templates, and the workflow automates the acquisition, harmonization, and execution of replication materials using pre-specified, version-controlled code. A structured knowledge layer records resolved failure patterns, enabling adaptation across heterogeneous studies while keeping each pipeline version transparent and stable. We evaluate this workflow on 92 instrumental variable (IV) studies, including 67 with manually verified reproducible 2SLS estimates and 25 newly published IV studies under identical criteria. For each paper, we analyze up to three two-stage least squares (2SLS) specifications, totaling 215. Across the 92 papers, the system achieves 87% end-to-end success overall. Conditional on accessible data and code, reproducibility is 100% at both the paper and specification levels. The framework substantially lowers the cost of executing established empirical protocols and can be adapted in empirical settings where analytic templates and norms of transparency are well established.
Paper Structure (37 sections, 1 equation, 5 figures, 5 tables)

This paper contains 37 sections, 1 equation, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of the agentic AI workflow for reproducibility.The above figure illustrates the three-layer agentic architecture enabled by Skills. A top-layer LLM orchestrator routes tasks and interprets errors but does not perform estimation. The middle layer defines structured input–output contracts and records resolved failure patterns. The bottom layer consists of rule-based agent code and diagnostic scripts in R, Stata, and Python that execute all file and statistical operations through a modular seven-stage pipeline, from material acquisition to standardized reports.
  • Figure 2: Executive summary
  • Figure 3: Coefficient plot for Spec 1
  • Figure 5: Relationship between OLS and 2SLS estimates.This figure replicates Figure 5 in lal2024much. Panel (a) rescales both coefficients by the reported OLS standard errors; the shaded region corresponds to the interval $[-1.96, 1.96]$. Panel (b) presents the distribution of the log absolute ratio between the reported 2SLS and OLS coefficients. Panels (c) and (d) examine how first-stage strength, measured by $|\hat{\rho}(d,\hat{d})|$, relates to the magnitude of the 2SLS-to-OLS ratio. Gray markers denote observational designs and red markers denote experiment-based instruments. Panel (d) further distinguishes designs in which the OLS estimate is statistically significant at the 5% level and is presented as part of the paper’s primary results.
  • Figure :