Enhancing Automated Paper Reproduction via Prompt-Free Collaborative Agents
Zijie Lin, Qilin Cai, Liang Shen, Mingjun Xiao
TL;DR
Reproducibility of scientific results is impeded by the lack of systematic verification during automated paper-to-code reproduction. The authors introduce prompt-free collaborative agents—a verification agent and a refinement agent—that rely solely on original system prompts to validate and improve outputs, integrated into the Paper2Code workflow. Across PaperBench Code-Dev and Paper2CodeBench, the approach yields substantial improvements in reproduction accuracy and completeness, and demonstrates robustness against Self-Refine while reducing iteration costs relative to RePro. This work offers a practical, scalable direction for automated code generation in research pipelines.
Abstract
Automated paper reproduction has emerged as a promising approach to accelerate scientific research, employing multi-step workflow frameworks to systematically convert academic papers into executable code. However, existing frameworks often lack mechanisms to verify and refine the outputs at each generation step, or rely heavily on manually designed prompts for self-refinement, which limits their adaptability and scalability. To address these limitations, we propose a prompt-free collaborative agent framework that automatically enhances the quality of paper-to-code generation. Our approach employs two collaborative agents: a verification agent that examines whether the outputs at each step satisfy the requirements specified in the corresponding system prompt, and a refinement agent that revises the outputs based on the identified issues. Unlike previous methods that require human experts to craft specific refinement prompts for each step, our framework achieves automatic verification and improvement by leveraging only the original system prompts. We integrate our collaborative agents into the Paper2Code framework and conduct comprehensive experiments on PaperBench Code-Dev and Paper2CodeBench datasets. Experimental results demonstrate that our approach significantly improves the accuracy and completeness of reproduced code, achieving performance gains of approximately 15\% and 13\%, respectively, compared to the baseline without our agents. Furthermore, comparative experiments against Self-Refine validate the robustness and consistency of our prompt-free approach across different datasets.
