Table of Contents
Fetching ...

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

Talor Abramovich, Gal Chechik

TL;DR

AblationBench provides a structured benchmark for evaluating automated ablation planning in empirical AI research, introducing two complementary tasks—AuthorAblation (planning ablations from a paper's method) and ReviewerAblation (identifying missing ablations in full submissions). It couples LM-based judges with ground-truth ablations to enable fully automated evaluation, and compares two planning paradigms (LM-based and agent-based SWE-agent) across diverse models and conferences. Key findings show frontier LMs identify only a minority of original ablations (best ~38% recall), humans outperform models, and a grounding trade-off exists between authorship- and reviewer-style ablations. The work highlights substantial room for improvement in automated ablation planning and provides publicly available data and code to spur progress in this direction.

Abstract

Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 38% of the original ablations on average, below human-level performance. We observe an inverse performance trend between the author and reviewer tasks, which we attribute to differences in model grounding. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms an agent-based approach. Our data is available on https://huggingface.co/collections/ai-coscientist/ablationbench, and our code is available on https://github.com/ai-scientist-bench/ablation-bench .

AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research

TL;DR

AblationBench provides a structured benchmark for evaluating automated ablation planning in empirical AI research, introducing two complementary tasks—AuthorAblation (planning ablations from a paper's method) and ReviewerAblation (identifying missing ablations in full submissions). It couples LM-based judges with ground-truth ablations to enable fully automated evaluation, and compares two planning paradigms (LM-based and agent-based SWE-agent) across diverse models and conferences. Key findings show frontier LMs identify only a minority of original ablations (best ~38% recall), humans outperform models, and a grounding trade-off exists between authorship- and reviewer-style ablations. The work highlights substantial room for improvement in automated ablation planning and provides publicly available data and code to spur progress in this direction.

Abstract

Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 38% of the original ablations on average, below human-level performance. We observe an inverse performance trend between the author and reviewer tasks, which we attribute to differences in model grounding. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms an agent-based approach. Our data is available on https://huggingface.co/collections/ai-coscientist/ablationbench, and our code is available on https://github.com/ai-scientist-bench/ablation-bench .

Paper Structure

This paper contains 81 sections, 46 figures, 12 tables.

Figures (46)

  • Figure 1: Overview of AblationBench. (Top) AuthorAblation is a collection of conference papers where the task of the planner is to generate ablation plan based on the paper's method section. The judge evaluates the generated plan using manually extracted gold ablations from the paper. (Bottom) ReviewerAblation is a collection of ICLR submissions where the task of the planner is to generate missing ablation based on the full submission. The judge evaluates the generated ablations using the official reviews of the submission.
  • Figure 2: Conference distribution of AuthorAblation test set.
  • Figure 3: Recall@5 performance for AuthorAblation of each planner approach across different LMs, separated by ablation type.
  • Figure 4: Qualitative analysis highlighting differences in the reasoning and grounding of Llama 3.1 405B and Claude 3.5 Sonnet on a ReviewerAblation instance.
  • Figure 5: An example of the matching criteria in AuthorAblation instance, showing alignment between GT (gold, top) and generated (blue, bottom) ablations. Matching is done semantically, and is represented as a bipartite graph: nodes correspond to GT and generated ablations and edges indicate valid matches. Evaluation metrics are computed with respect to the GT ablations; in this example, both precision and recall equal 0.75.
  • ...and 41 more figures