MURPHY: Multi-Turn GRPO for Self Correcting Code Generation
Chanakya Ekbote, Vijay Lingam, Behrooz Omidvar-Tehrani, Jun Huan, Sujay Sanghavi, Anoop Deoras, Stefano Soatto
TL;DR
This paper tackles the limitation of single-turn GRPO in RLVR for code generation by introducing Murphy, a multi-turn rollout framework that conditions optimization on intermediate, feedback-driven prompts and propagates rewards backward. Murphy combines two credit-assignment strategies (MaRS and MeRS) with pruning methods (IntraP and InterP) to manage computational cost while leveraging both quantitative and qualitative feedback from environment executions. Empirical results on Qwen3 and OLMo models trained with KodCode show Murphy achieving up to 8% relative gains in pass@1 over GRPO across multiple benchmarks, demonstrating improved reasoning refinement and self-correction. The work highlights the practical value of training-time, feedback-grounded optimization for agentic code generation and points to future directions in adaptive rollouts and broader reasoning domains.
Abstract
Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful framework for enhancing the reasoning capabilities of large language models (LLMs). However, existing approaches such as Group Relative Policy Optimization (GRPO) and its variants, while effective on reasoning benchmarks, struggle with agentic tasks that require iterative decision-making. We introduce Murphy, a multi-turn reflective optimization framework that extends GRPO by incorporating iterative self-correction during training. By leveraging both quantitative and qualitative execution feedback, Murphy enables models to progressively refine their reasoning across multiple turns. Evaluations on code generation benchmarks with model families such as Qwen and OLMo show that Murphy consistently improves performance, achieving up to a 8% relative gain in pass@1 over GRPO, on similar compute budgets.
