Table of Contents
Fetching ...

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

Zachary S. Siegel, Sayash Kapoor, Nitya Nagdir, Benedikt Stroebl, Arvind Narayanan

TL;DR

CORE-Bench introduces a computational reproducibility benchmark comprising 270 tasks drawn from 90 CodeOcean capsules across three disciplines to evaluate AI agents’ ability to reproduce published results. Using an isolated-VM evaluation harness, the study compares AutoGPT and a task-tailored CORE-Agent across GPT-4o and GPT-4o-mini backends, revealing substantial room for improvement (best ~60% accuracy on easy tasks, dropping to ~21% on hard). The results highlight that task-specific prompting, model strength, and modality (text vs. vision) significantly influence performance, while retrieval and dependency-management challenges pose practical bottlenecks. The authors argue that improving reproducibility automation is a necessary step toward scalable, autonomous scientific inquiry and provide a reproducible evaluation framework to accelerate progress. Ultimately, CORE-Bench aims to drive development of safer, more capable agents that can verify and extend existing research work.

Abstract

AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.

CORE-Bench: Fostering the Credibility of Published Research Through a Computational Reproducibility Agent Benchmark

TL;DR

CORE-Bench introduces a computational reproducibility benchmark comprising 270 tasks drawn from 90 CodeOcean capsules across three disciplines to evaluate AI agents’ ability to reproduce published results. Using an isolated-VM evaluation harness, the study compares AutoGPT and a task-tailored CORE-Agent across GPT-4o and GPT-4o-mini backends, revealing substantial room for improvement (best ~60% accuracy on easy tasks, dropping to ~21% on hard). The results highlight that task-specific prompting, model strength, and modality (text vs. vision) significantly influence performance, while retrieval and dependency-management challenges pose practical bottlenecks. The authors argue that improving reproducibility automation is a necessary step toward scalable, autonomous scientific inquiry and provide a reproducible evaluation framework to accelerate progress. Ultimately, CORE-Bench aims to drive development of safer, more capable agents that can verify and extend existing research work.

Abstract

AI agents have the potential to aid users on a variety of consequential tasks, including conducting scientific research. To spur the development of useful agents, we need benchmarks that are challenging, but more crucially, directly correspond to real-world tasks of interest. This paper introduces such a benchmark, designed to measure the accuracy of AI agents in tackling a crucial yet surprisingly challenging aspect of scientific research: computational reproducibility. This task, fundamental to the scientific process, involves reproducing the results of a study using the provided code and data. We introduce CORE-Bench (Computational Reproducibility Agent Benchmark), a benchmark consisting of 270 tasks based on 90 scientific papers across three disciplines (computer science, social science, and medicine). Tasks in CORE-Bench consist of three difficulty levels and include both language-only and vision-language tasks. We provide an evaluation system to measure the accuracy of agents in a fast and parallelizable way, saving days of evaluation time for each run compared to a sequential implementation. We evaluated two baseline agents: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. We tested both variants using two underlying language models: GPT-4o and GPT-4o-mini. The best agent achieved an accuracy of 21% on the hardest task, showing the vast scope for improvement in automating routine scientific tasks. Having agents that can reproduce existing work is a necessary step towards building agents that can conduct novel research and could verify and improve the performance of other research agents. We hope that CORE-Bench can improve the state of reproducibility and spur the development of future research agents.
Paper Structure (37 sections, 11 figures, 8 tables)

This paper contains 37 sections, 11 figures, 8 tables.

Figures (11)

  • Figure 1: Overview of CORE-Bench. Each task in CORE-Bench requires an agent to reproduce the results of a research paper given its repository. The agent must install libraries, packages, and dependencies and run the code. If the code runs successfully, the agent needs to search through all outputs to answer the task questions. The agent submits a report and is evaluated against the results of a successful reproduction. An agent successfully completes a task if it correctly answers all questions about a code repository.
  • Figure 2: Files and folders in each CodeOcean capsule. Each capsule contains a Readme, Dockerfile, and instructions on how to use Docker, which we selectively provide to the agent depending on the difficulty of the task.
  • Figure 3: Capsule selection process. We filtered the 5,090 capsules on CodeOcean by discipline, language, and the ten selection criteria to arrive at the 90 capsules selected for CORE-Bench. We provide a breakdown of capsules by discipline in \ref{['appendix:overall_stats']}.
  • Figure 4: During task execution, the agent must interpret the task prompt, set up the code in the capsule, run the code, and populate the specified result in the provided JSON file. For evaluation, we manually reproduced each capsule in the benchmark three times. We determine if an agent correctly solves a task if the agent's reported results for all questions fall within a 95% prediction interval for every task question of the results from the three manual runs (although only 17 / 181 task questions have stochastic answers). Prediction intervals provide a range in which we expect future observations to fall, accounting for stochasticity in the code outputs spence_prediction_2016.
  • Figure 5: (1) The manager machine creates a VM for each (agent, task) pair and uploads both the capsule and the agent code to the VM. (2) The manager machine invokes the agent on each of the VMs, so they all run in parallel. (3) The manager machine downloads the results from the agent off each VM once the agent indicates task completion, deletes the VM, and locally evaluates all of the results.
  • ...and 6 more figures