Table of Contents
Fetching ...

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

Minju Seo, Jinheon Baek, Seongyun Lee, Sung Ju Hwang

TL;DR

Machine learning research often lacks accessible code, hindering reproducibility. The authors introduce PaperCoder, a three-stage, multi-agent LLM framework that translates ML papers into faithful, repository-level code without relying on preexisting implementations. Through Paper2CodeBench and PaperBench, PaperCoder demonstrates strong performance, high fidelity to author intent, and near-executable code with minimal manual debugging. The work highlights the potential of structured LLM workflows to accelerate scientific progress and reproducibility, while also offering insights into evaluation alignment and backbone model choices.

Abstract

Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, particularly from the authors of those papers, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going-doer/Paper2Code.

Paper2Code: Automating Code Generation from Scientific Papers in Machine Learning

TL;DR

Machine learning research often lacks accessible code, hindering reproducibility. The authors introduce PaperCoder, a three-stage, multi-agent LLM framework that translates ML papers into faithful, repository-level code without relying on preexisting implementations. Through Paper2CodeBench and PaperBench, PaperCoder demonstrates strong performance, high fidelity to author intent, and near-executable code with minimal manual debugging. The work highlights the potential of structured LLM workflows to accelerate scientific progress and reproducibility, while also offering insights into evaluation alignment and backbone model choices.

Abstract

Despite the rapid growth of machine learning research, corresponding code implementations are often unavailable, making it slow and labor-intensive for researchers to reproduce results and build upon prior work. In the meantime, recent Large Language Models (LLMs) excel at understanding scientific documents and generating high-quality code. Inspired by this, we introduce PaperCoder, a multi-agent LLM framework that transforms machine learning papers into functional code repositories. PaperCoder operates in three stages: planning, where it constructs a high-level roadmap, designs the system architecture with diagrams, identifies file dependencies, and generates configuration files; analysis, which focuses on interpreting implementation-specific details; and generation, where modular, dependency-aware code is produced. Moreover, each phase is instantiated through a set of specialized agents designed to collaborate effectively across the pipeline. We then evaluate PaperCoder on generating code implementations from machine learning papers based on both model-based and human evaluations, particularly from the authors of those papers, with author-released repositories as ground truth if available. Our results demonstrate the effectiveness of PaperCoder in creating high-quality, faithful implementations. Furthermore, it consistently shows strengths in the recently released PaperBench benchmark, surpassing strong baselines by substantial margins. Code is available at: https://github.com/going-doer/Paper2Code.

Paper Structure

This paper contains 50 sections, 1 equation, 39 figures, 21 tables.

Figures (39)

  • Figure 1: (a) PaperCoder, which aims to transform given scientific papers into code repositories, consisting of planning, analysis, and coding steps. (b) Code availability, where blue bars indicate the total number of accepted papers and orange regions show those with officially released code (See Appendix \ref{['sec:code_availability']} for calculation details).
  • Figure 2: (Left) The naive approach, which directly generates an entire code repository from a paper. (Right) Our PaperCoder framework, which is operationalized by decomposing the task into three stages: (1) Planning, where a high-level implementation plan is constructed from the paper, including overall plan, architectural design, logic design, and configuration file; (2) Analysis, where the plan is translated into detailed file-level specifications; and (3) Coding, where the final codes are generated to implement the methods and experiments of the paper.
  • Figure 3: Correlation between model-based evaluations: reference-based and reference-free.
  • Figure 3: PaperBench Code-Dev results. We report the averaged performance over three runs with standard deviations.
  • Figure 4: Model-based evaluation results by paper presentation types.
  • ...and 34 more figures