Table of Contents
Fetching ...

The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research

Xiaoyan Bai, Alexander Baumgartner, Haojia Sun, Ari Holtzman, Chenhao Tan

TL;DR

This work proposes the first execution-grounded evaluation framework that verifies research beyond narrative review by examining code and data alongside the paper and develops MechEvalAgent, an automated evaluation framework that assesses the coherence of the experimental process, the reproducibility of results, and the generalizability of findings.

Abstract

Reproducibility crises across sciences highlight the limitations of the paper-centric review system in assessing the rigor and reproducibility of research. AI agents that autonomously design and generate large volumes of research outputs exacerbate these challenges. In this work, we address the growing challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators. We propose the first execution-grounded evaluation framework that verifies research beyond narrative review by examining code and data alongside the paper. We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent, an automated evaluation framework that assesses the coherence of the experimental process, the reproducibility of results, and the generalizability of findings. We show that our framework achieves above 80% agreement with human judges, identifies substantial methodological problems, and surfaces 51 additional issues that human reviewers miss. Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.

The Story is Not the Science: Execution-Grounded Evaluation of Mechanistic Interpretability Research

TL;DR

This work proposes the first execution-grounded evaluation framework that verifies research beyond narrative review by examining code and data alongside the paper and develops MechEvalAgent, an automated evaluation framework that assesses the coherence of the experimental process, the reproducibility of results, and the generalizability of findings.

Abstract

Reproducibility crises across sciences highlight the limitations of the paper-centric review system in assessing the rigor and reproducibility of research. AI agents that autonomously design and generate large volumes of research outputs exacerbate these challenges. In this work, we address the growing challenges of scalability and rigor by flipping the dynamic and developing AI agents as research evaluators. We propose the first execution-grounded evaluation framework that verifies research beyond narrative review by examining code and data alongside the paper. We use mechanistic interpretability research as a testbed, build standardized research output, and develop MechEvalAgent, an automated evaluation framework that assesses the coherence of the experimental process, the reproducibility of results, and the generalizability of findings. We show that our framework achieves above 80% agreement with human judges, identifies substantial methodological problems, and surfaces 51 additional issues that human reviewers miss. Our work demonstrates the potential of AI agents to transform research evaluation and pave the way for rigorous scientific practices.
Paper Structure (12 sections, 15 figures, 4 tables)

This paper contains 12 sections, 15 figures, 4 tables.

Figures (15)

  • Figure 1: (a) Execution-grouned evaluation uncovers failures that narrative-alone review misses. In this example, Failures 2, 3, and 4 require execution beyond narrative review. (b) As a highlight of our results, we find that MechEvalAgent surfaces 51 additional issues that human reviewers overlooked.
  • Figure 2: Overview of the MechEvalAgent framework. Research outputs are evaluated on coherence, reproducibility, and generalization, with each sub-dimension handled by an agent that takes in the relevant inputs.
  • Figure 3: Percentage of projects with at least one failure per dimension. Over 90% of tasks fail in reproducibility, and 80% fail in coherence.
  • Figure 4: Human-rated quality on MechEvalAgent evaluations (1-5 Likert scale, 1 = Strongly Disagree, 5 = Strongly Agree). All dimensions show ratings above 4.7, indicating high quality of agent assessments.
  • Figure 5: Failure breakdown comparing human-identified and agent-identified issues. MechEvalAgent surfaces more unique issues in all three dimensions.
  • ...and 10 more figures