Table of Contents
Fetching ...

Process Supervision-Guided Policy Optimization for Code Generation

Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, Lin Yan

TL;DR

This work addresses the sparse end-to-end rewards in RL-based code generation by introducing a Process Reward Model (PRM) that provides dense, line-level feedback during code construction. PSGPO integrates PRM signals both as dense rewards and as value-function initialization, enabling more efficient exploration and faster credit assignment. Empirical results across HumanEval, MBPP, and LiveCodeBench show consistent gains, especially for long-horizon tasks, with best performance achieved when PRM supplies both dense rewards and value initialization. The method relies on automated PRM data collection via binary-search labeling and careful safeguards against reward hacking, offering practical guidelines for data selection and integration into RL pipelines with potential applicability beyond code generation. The approach advances RL-based code generation by closing the feedback loop between partial correctness and iterative refinement, emulating human programming workflows.

Abstract

Reinforcement learning (RL) with unit test feedback has enhanced large language models' (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation, limiting learning efficiency and incremental improvements. When generated code fails all unit tests, no learning signal is received, hindering progress on complex tasks. To address this, we propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement and providing immediate guidance. We explore various strategies for training PRMs and integrating them into the RL framework, finding that using PRMs both as dense rewards and for value function initialization significantly boosts performance. Our experimental results also highlight the effectiveness of PRMs in enhancing RL-driven code generation, especially for long-horizon scenarios.

Process Supervision-Guided Policy Optimization for Code Generation

TL;DR

This work addresses the sparse end-to-end rewards in RL-based code generation by introducing a Process Reward Model (PRM) that provides dense, line-level feedback during code construction. PSGPO integrates PRM signals both as dense rewards and as value-function initialization, enabling more efficient exploration and faster credit assignment. Empirical results across HumanEval, MBPP, and LiveCodeBench show consistent gains, especially for long-horizon tasks, with best performance achieved when PRM supplies both dense rewards and value initialization. The method relies on automated PRM data collection via binary-search labeling and careful safeguards against reward hacking, offering practical guidelines for data selection and integration into RL pipelines with potential applicability beyond code generation. The approach advances RL-based code generation by closing the feedback loop between partial correctness and iterative refinement, emulating human programming workflows.

Abstract

Reinforcement learning (RL) with unit test feedback has enhanced large language models' (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation, limiting learning efficiency and incremental improvements. When generated code fails all unit tests, no learning signal is received, hindering progress on complex tasks. To address this, we propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement and providing immediate guidance. We explore various strategies for training PRMs and integrating them into the RL framework, finding that using PRMs both as dense rewards and for value function initialization significantly boosts performance. Our experimental results also highlight the effectiveness of PRMs in enhancing RL-driven code generation, especially for long-horizon scenarios.

Paper Structure

This paper contains 41 sections, 10 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of our method. Our approach consists of two main components: (1) a binary search-based method for automating PRM training data labeling, used to train a code PRM; and (2) the integration of PRM into RL training, where it serves as (a) the initialization for the value model and (b) an evaluator assessing the correctness of each line of code, providing dense reward signals.
  • Figure 2: Binary search over code steps at line level to label prefixes. The first midpoint at $m=3$ is accepted, so the search interval moves to $[4,5]$. The next midpoint at $m=4$ is rejected, indicating errors occur after step 3.
  • Figure 3: Best-of-K performance curves for all RL training settings, showing the percentage of problems solved within $K$ generated responses.
  • Figure 4: Pass@1 difference between policies trained with and without PRM across varying response lengths. Policies trained with PRM exhibit consistent improvements over those without PRM for longer-horizon responses (greater than 100 tokens). This demonstrates PRM’s effectiveness in providing intermediate feedback, thereby enabling RL to do more explorations.
  • Figure 5: Pass@1 on LiveCodeBench as the average number of responses per prompt for PRM data collection increases (logarithmic scale). A value of $< 2^0$ indicates that we subsampled prompts from the full dataset, resulting in a smaller prompt set.
  • ...and 3 more figures