AblationBench: Evaluating Automated Planning of Ablations in Empirical AI Research
Talor Abramovich, Gal Chechik
TL;DR
AblationBench provides a structured benchmark for evaluating automated ablation planning in empirical AI research, introducing two complementary tasks—AuthorAblation (planning ablations from a paper's method) and ReviewerAblation (identifying missing ablations in full submissions). It couples LM-based judges with ground-truth ablations to enable fully automated evaluation, and compares two planning paradigms (LM-based and agent-based SWE-agent) across diverse models and conferences. Key findings show frontier LMs identify only a minority of original ablations (best ~38% recall), humans outperform models, and a grounding trade-off exists between authorship- and reviewer-style ablations. The work highlights substantial room for improvement in automated ablation planning and provides publicly available data and code to spur progress in this direction.
Abstract
Language model agents are increasingly used to automate scientific research, yet evaluating their scientific contributions remains a challenge. A key mechanism to obtain such insights is through ablation experiments. To this end, we introduce AblationBench, a benchmark suite for evaluating agents on ablation planning tasks in empirical AI research. It includes two tasks: AuthorAblation, which helps authors propose ablation experiments based on a method section and contains 83 instances, and ReviewerAblation, which helps reviewers find missing ablations in a full paper and contains 350 instances. For both tasks, we develop LM-based judges that serve as an automatic evaluation framework. Our experiments with frontier LMs show that these tasks remain challenging, with the best-performing LM system identifying only 38% of the original ablations on average, below human-level performance. We observe an inverse performance trend between the author and reviewer tasks, which we attribute to differences in model grounding. Lastly, we analyze the limitations of current LMs on these tasks, and find that chain-of-thought prompting outperforms an agent-based approach. Our data is available on https://huggingface.co/collections/ai-coscientist/ablationbench, and our code is available on https://github.com/ai-scientist-bench/ablation-bench .
