Table of Contents
Fetching ...

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

Yanzheng Xiang, Hanqi Yan, Shuyin Ouyang, Lin Gui, Yulan He

TL;DR

SciReplicate-Bench benchmarks LLMs on reproducing algorithms described in recent NLP papers by coupling algorithm understanding with code generation through a dual-agent framework, Sci-Reproducer. The framework uses a reasoning-graph accuracy metric alongside execution accuracy, CodeBLEU, and dependency recall to quantify both comprehension and implementation quality. Results show the task is highly challenging, with top performance around 0.39 execution accuracy, and reveal that missing or inconsistent paper descriptions are major bottlenecks. The benchmark and associated tooling are released to support reproducible evaluation of AI-assisted scientific coding and automated verification.

Abstract

This study evaluates large language models (LLMs) in generating code from algorithm descriptions in recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic literature to understand implementation logic, and (2) coding expertise: identifying dependencies and correctly implementing necessary APIs. To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024, featuring detailed annotations and comprehensive test cases. Building on SciReplicate-Bench, we propose Sci-Reproducer, a dual-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implements solutions. To assess algorithm understanding, we introduce reasoning graph accuracy, which quantifies similarity between generated and reference reasoning graphs derived from code comments and structure. For evaluating implementation quality, we employ execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In our experiments, we evaluate various powerful non-reasoning and reasoning LLMs as foundational models. The best-performing LLM using \ModelName~achieves only 39% execution accuracy, highlighting the benchmark's difficulty. Our analysis identifies missing or inconsistent algorithm descriptions as key barriers to successful reproduction. We make available our benchmark and code at https://github.com/xyzCS/SciReplicate-Bench and project homepage at https://xyzcs.github.io/scireplicate.github.io/.

SciReplicate-Bench: Benchmarking LLMs in Agent-driven Algorithmic Reproduction from Research Papers

TL;DR

SciReplicate-Bench benchmarks LLMs on reproducing algorithms described in recent NLP papers by coupling algorithm understanding with code generation through a dual-agent framework, Sci-Reproducer. The framework uses a reasoning-graph accuracy metric alongside execution accuracy, CodeBLEU, and dependency recall to quantify both comprehension and implementation quality. Results show the task is highly challenging, with top performance around 0.39 execution accuracy, and reveal that missing or inconsistent paper descriptions are major bottlenecks. The benchmark and associated tooling are released to support reproducible evaluation of AI-assisted scientific coding and automated verification.

Abstract

This study evaluates large language models (LLMs) in generating code from algorithm descriptions in recent NLP papers. The task requires two key competencies: (1) algorithm comprehension: synthesizing information from papers and academic literature to understand implementation logic, and (2) coding expertise: identifying dependencies and correctly implementing necessary APIs. To facilitate rigorous evaluation, we introduce SciReplicate-Bench, a benchmark of 100 tasks from 36 NLP papers published in 2024, featuring detailed annotations and comprehensive test cases. Building on SciReplicate-Bench, we propose Sci-Reproducer, a dual-agent framework consisting of a Paper Agent that interprets algorithmic concepts from literature and a Code Agent that retrieves dependencies from repositories and implements solutions. To assess algorithm understanding, we introduce reasoning graph accuracy, which quantifies similarity between generated and reference reasoning graphs derived from code comments and structure. For evaluating implementation quality, we employ execution accuracy, CodeBLEU, and repository dependency/API recall metrics. In our experiments, we evaluate various powerful non-reasoning and reasoning LLMs as foundational models. The best-performing LLM using \ModelName~achieves only 39% execution accuracy, highlighting the benchmark's difficulty. Our analysis identifies missing or inconsistent algorithm descriptions as key barriers to successful reproduction. We make available our benchmark and code at https://github.com/xyzCS/SciReplicate-Bench and project homepage at https://xyzcs.github.io/scireplicate.github.io/.

Paper Structure

This paper contains 38 sections, 1 equation, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Overview of the task and the proposed Sci-Reproducer framework. The task involves algorithm understanding and code implementation, handled by a Paper Agent and a Code Agent operating in separate contexts with specialized actions.
  • Figure 2: A grouped bar chart illustrating the frequency of tool usage by different models. The x-axis represents various actions, while the y-axis indicates the total number of times each tool was used on this dataset.
  • Figure A1: The categories of the tasks within SciReplicate-Bench.
  • Figure A2: The task consists of two steps: Algorithm Understanding and Code Implementation. (Left) The model must extract an algorithm’s workflow and details from the research paper, including descriptions and variable values from cited papers and other paper sections. (Right) Using this extracted information, the model implements the corresponding function in the code repository, correctly handling dependencies and API calls.
  • Figure A3: The overview of the SciReplicate-Bench.
  • ...and 5 more figures