Table of Contents
Fetching ...

AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science

Qiuhai Zeng, Claire Jin, Xinyue Wang, Yuhan Zheng, Qunhua Li

TL;DR

AIRepr introduces an Analyst-Inspector framework to rigorously evaluate reproducibility of LLM-generated data analyses by testing whether an independent inspector can reproduce the analyst's workflow and conclusions from the workflow alone. The approach formalizes reproducibility with sufficiency and completeness concepts and proposes RoT and RReflexion prompting to enhance workflow clarity and fidelity. Across 1,032 tasks from three benchmarks and 15 analyst-inspector pairs, the study shows that higher workflow reproducibility correlates with improved accuracy and that reproducibility-focused prompts boost both metrics, with RoT and RReflexion delivering substantial gains. The framework proves robust to inspector choice and supports scalable, transparent human-AI collaboration in data science; the authors also release code publicly to facilitate adoption and further research.

Abstract

Large language models (LLMs) are increasingly used to automate data analysis through executable code generation. Yet, data science tasks often admit multiple statistically valid solutions, e.g. different modeling strategies, making it critical to understand the reasoning behind analyses, not just their outcomes. While manual review of LLM-generated code can help ensure statistical soundness, it is labor-intensive and requires expertise. A more scalable approach is to evaluate the underlying workflows-the logical plans guiding code generation. However, it remains unclear how to assess whether an LLM-generated workflow supports reproducible implementations. To address this, we present AIRepr, an Analyst-Inspector framework for automatically evaluating and improving the reproducibility of LLM-generated data analysis workflows. Our framework is grounded in statistical principles and supports scalable, automated assessment. We introduce two novel reproducibility-enhancing prompting strategies and benchmark them against standard prompting across 15 analyst-inspector LLM pairs and 1,032 tasks from three public benchmarks. Our findings show that workflows with higher reproducibility also yield more accurate analyses, and that reproducibility-enhancing prompts substantially improve both metrics. This work provides a foundation for transparent, reliable, and efficient human-AI collaboration in data science. Our code is publicly available.

AIRepr: An Analyst-Inspector Framework for Evaluating Reproducibility of LLMs in Data Science

TL;DR

AIRepr introduces an Analyst-Inspector framework to rigorously evaluate reproducibility of LLM-generated data analyses by testing whether an independent inspector can reproduce the analyst's workflow and conclusions from the workflow alone. The approach formalizes reproducibility with sufficiency and completeness concepts and proposes RoT and RReflexion prompting to enhance workflow clarity and fidelity. Across 1,032 tasks from three benchmarks and 15 analyst-inspector pairs, the study shows that higher workflow reproducibility correlates with improved accuracy and that reproducibility-focused prompts boost both metrics, with RoT and RReflexion delivering substantial gains. The framework proves robust to inspector choice and supports scalable, transparent human-AI collaboration in data science; the authors also release code publicly to facilitate adoption and further research.

Abstract

Large language models (LLMs) are increasingly used to automate data analysis through executable code generation. Yet, data science tasks often admit multiple statistically valid solutions, e.g. different modeling strategies, making it critical to understand the reasoning behind analyses, not just their outcomes. While manual review of LLM-generated code can help ensure statistical soundness, it is labor-intensive and requires expertise. A more scalable approach is to evaluate the underlying workflows-the logical plans guiding code generation. However, it remains unclear how to assess whether an LLM-generated workflow supports reproducible implementations. To address this, we present AIRepr, an Analyst-Inspector framework for automatically evaluating and improving the reproducibility of LLM-generated data analysis workflows. Our framework is grounded in statistical principles and supports scalable, automated assessment. We introduce two novel reproducibility-enhancing prompting strategies and benchmark them against standard prompting across 15 analyst-inspector LLM pairs and 1,032 tasks from three public benchmarks. Our findings show that workflows with higher reproducibility also yield more accurate analyses, and that reproducibility-enhancing prompts substantially improve both metrics. This work provides a foundation for transparent, reliable, and efficient human-AI collaboration in data science. Our code is publicly available.

Paper Structure

This paper contains 38 sections, 4 equations, 14 figures, 23 tables.

Figures (14)

  • Figure 1: A human-in-the-loop pipeline for AI-generated data analysis. Given a data analysis task, an AI Analyst generates a workflow (analysis steps), code, and conclusion. An independent AI Inspector takes the workflow to generate new code and a new conclusion. If the AI Analyst's conclusion is independently reproducible—meaning the workflow provides complete and sufficient details for the generated code—then human analysts can focus on evaluating the soundness of the workflow without manually verifying the code. However, if the AI solution fails the reproducibility check, the solution needs to be revised before being submitted for human review.
  • Figure 2: Analyst-Inspector framework for assessing LLM data analysis reproducibility.
  • Figure 3: Accuracy comparison between reproducible ($R=1$) and irreproducible ($R=0$) solutions across LLMs and datasets using CoT prompting. Non-executables are excluded from the $R=0$ group. The x- and y-axes show the proportion of accurate solutions in each group. The diagonal line indicates equal accuracy between the two groups. A one-sided t-test evaluates whether reproducible solutions are significantly more accurate.
  • Figure 4: Accuracy and reproducibility of LLMs across datasets for different prompting strategies. Solid line: the average metric scores calculated from different LLMs for each prompting strategy.
  • Figure 5: Accuracy and reproducibility of LLMs across different statistical question categories in StatQA.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Definition