Table of Contents
Fetching ...

RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

Chunyu Miao, Henry Peng Zou, Yangning Li, Yankai Chen, Yibo Wang, Fangxin Wang, Yifan Li, Wooseong Yang, Bowei He, Xinni Zhang, Dianzhi Yu, Hanchen Yang, Hoang H Nguyen, Yue Zhou, Jie Yang, Jizhou Guo, Wenzhe Fan, Chin-Yuan Yeh, Panpan Meng, Liancheng Fang, Jinhu Qi, Wei-Chieh Huang, Zhengyao Gu, Yuwei Han, Langzhou He, Yuyao Yang, Yinghui Li, Hai-Tao Zheng, Xue Liu, Irwin King, Philip S. Yu

TL;DR

RECODE-H addresses the gap in evaluating LLMs for interactive research code development by introducing a repository-level benchmark with 102 tasks drawn from real papers and codebases. It pairs structured instructions and unit tests with a five-level, human-feedback hierarchy and a ReCodeAgent framework that iteratively refines code across multiple turns. Experiments across seven state-of-the-art LLMs show that richer feedback substantially improves functional correctness, recall, and code similarity, with larger models benefiting most while also revealing adoption dynamics and persistent semantic challenges. The work provides a foundation for developing adaptive, feedback-driven agents capable of implementing and refining scientific methods in software, guiding future research in interactive code generation for science.

Abstract

Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation

RECODE-H: A Benchmark for Research Code Development with Interactive Human Feedback

TL;DR

RECODE-H addresses the gap in evaluating LLMs for interactive research code development by introducing a repository-level benchmark with 102 tasks drawn from real papers and codebases. It pairs structured instructions and unit tests with a five-level, human-feedback hierarchy and a ReCodeAgent framework that iteratively refines code across multiple turns. Experiments across seven state-of-the-art LLMs show that richer feedback substantially improves functional correctness, recall, and code similarity, with larger models benefiting most while also revealing adoption dynamics and persistent semantic challenges. The work provides a foundation for developing adaptive, feedback-driven agents capable of implementing and refining scientific methods in software, guiding future research in interactive code generation for science.

Abstract

Large language models (LLMs) show the promise in supporting scientific research implementation, yet their ability to generate correct and executable code remains limited. Existing works largely adopt one-shot settings, ignoring the iterative and feedback-driven nature of realistic workflows of scientific research development. To address this gap, we present RECODE-H, a benchmark of 102 tasks from research papers and repositories that evaluates LLM agents through multi-turn interactions with LLM-simulated human feedback. It includes structured instructions,unit tests, and a five-level feedback hierarchy to reflect realistic researcher-agent collaboration. We further present ReCodeAgent, a framework that integrates feedback into iterative code generation. Experiments with leading LLMs, including GPT-5, Claude-Sonnet-4, DeepSeek-V3.1, and Gemini 2.5, show substantial performance gains with richer feedback, while also highlighting ongoing challenges in the generation of complex research code. RECODE-H establishes a foundation for developing adaptive, feedback-driven LLM agents in scientific research implementation

Paper Structure

This paper contains 26 sections, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Illustration of the RECODE-H workflow, where LLM agents iteratively generate, test, and refine research code through structured researcher feedback.
  • Figure 2: Pass rate trajectories across interaction turns under varying feedback levels. Richer feedback consistently boosts model performance, with the largest gains appearing in early turns. Stronger models like GPT-5 and DeepSeek-V3.1 adapt more effectively, while Gemini-2.5-flash and Claude-Sonnet-4 plateau earlier.
  • Figure 3: The domain of the tasks within RECODE-H.
  • Figure 4: The rise of average pass rate of GPT-5 when guided by different feedback models across feedback levels. GPT-5 feedback yields the strongest improvements, particularly at Level 4.
  • Figure 5: Average pass rate of GPT-5-mini when guided by different feedback models across feedback levels. GPT-5 feedback yields the strongest improvements, particularly at Level 4.
  • ...and 5 more figures