Table of Contents
Fetching ...

MURPHY: Multi-Turn GRPO for Self Correcting Code Generation

Chanakya Ekbote, Vijay Lingam, Behrooz Omidvar-Tehrani, Jun Huan, Sujay Sanghavi, Anoop Deoras, Stefano Soatto

TL;DR

This paper tackles the limitation of single-turn GRPO in RLVR for code generation by introducing Murphy, a multi-turn rollout framework that conditions optimization on intermediate, feedback-driven prompts and propagates rewards backward. Murphy combines two credit-assignment strategies (MaRS and MeRS) with pruning methods (IntraP and InterP) to manage computational cost while leveraging both quantitative and qualitative feedback from environment executions. Empirical results on Qwen3 and OLMo models trained with KodCode show Murphy achieving up to 8% relative gains in pass@1 over GRPO across multiple benchmarks, demonstrating improved reasoning refinement and self-correction. The work highlights the practical value of training-time, feedback-grounded optimization for agentic code generation and points to future directions in adaptive rollouts and broader reasoning domains.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful framework for enhancing the reasoning capabilities of large language models (LLMs). However, existing approaches such as Group Relative Policy Optimization (GRPO) and its variants, while effective on reasoning benchmarks, struggle with agentic tasks that require iterative decision-making. We introduce Murphy, a multi-turn reflective optimization framework that extends GRPO by incorporating iterative self-correction during training. By leveraging both quantitative and qualitative execution feedback, Murphy enables models to progressively refine their reasoning across multiple turns. Evaluations on code generation benchmarks with model families such as Qwen and OLMo show that Murphy consistently improves performance, achieving up to a 8% relative gain in pass@1 over GRPO, on similar compute budgets.

MURPHY: Multi-Turn GRPO for Self Correcting Code Generation

TL;DR

This paper tackles the limitation of single-turn GRPO in RLVR for code generation by introducing Murphy, a multi-turn rollout framework that conditions optimization on intermediate, feedback-driven prompts and propagates rewards backward. Murphy combines two credit-assignment strategies (MaRS and MeRS) with pruning methods (IntraP and InterP) to manage computational cost while leveraging both quantitative and qualitative feedback from environment executions. Empirical results on Qwen3 and OLMo models trained with KodCode show Murphy achieving up to 8% relative gains in pass@1 over GRPO across multiple benchmarks, demonstrating improved reasoning refinement and self-correction. The work highlights the practical value of training-time, feedback-grounded optimization for agentic code generation and points to future directions in adaptive rollouts and broader reasoning domains.

Abstract

Reinforcement Learning with Verifiable Rewards (RLVR) has emerged as a powerful framework for enhancing the reasoning capabilities of large language models (LLMs). However, existing approaches such as Group Relative Policy Optimization (GRPO) and its variants, while effective on reasoning benchmarks, struggle with agentic tasks that require iterative decision-making. We introduce Murphy, a multi-turn reflective optimization framework that extends GRPO by incorporating iterative self-correction during training. By leveraging both quantitative and qualitative execution feedback, Murphy enables models to progressively refine their reasoning across multiple turns. Evaluations on code generation benchmarks with model families such as Qwen and OLMo show that Murphy consistently improves performance, achieving up to a 8% relative gain in pass@1 over GRPO, on similar compute budgets.

Paper Structure

This paper contains 32 sections, 12 equations, 2 figures, 6 tables.

Figures (2)

  • Figure 1: Percentage change in coding problems solved by models trained with Murphy and GRPO over the base model across three models and datasets. Murphy-trained models solve up to $4.2\%$ more problems than GRPO. See \ref{['tab:main_table']} for details.
  • Figure 2: Overview of Murphy. Given an input prompt (q), $G$ code generations (o) are generated and evaluated using a reward function (r). Generations that do not achieve the maximum reward are revised based on executor feedback (f), combining the original prompt with the failed output, and re-prompted to generate another $G$ candidates. This iterative process continues for a fixed number of turns, with rewards from later turns propagated backward. The example illustrates the case with $G=2$, where $G$ represents the number of rollouts per prompt, and $\rho(\cdot)$ denotes the credit assignment strategy.