Process Supervision-Guided Policy Optimization for Code Generation
Ning Dai, Zheng Wu, Renjie Zheng, Ziyun Wei, Wenlei Shi, Xing Jin, Guanlin Liu, Chen Dun, Liang Huang, Lin Yan
TL;DR
This work addresses the sparse end-to-end rewards in RL-based code generation by introducing a Process Reward Model (PRM) that provides dense, line-level feedback during code construction. PSGPO integrates PRM signals both as dense rewards and as value-function initialization, enabling more efficient exploration and faster credit assignment. Empirical results across HumanEval, MBPP, and LiveCodeBench show consistent gains, especially for long-horizon tasks, with best performance achieved when PRM supplies both dense rewards and value initialization. The method relies on automated PRM data collection via binary-search labeling and careful safeguards against reward hacking, offering practical guidelines for data selection and integration into RL pipelines with potential applicability beyond code generation. The approach advances RL-based code generation by closing the feedback loop between partial correctness and iterative refinement, emulating human programming workflows.
Abstract
Reinforcement learning (RL) with unit test feedback has enhanced large language models' (LLMs) code generation, but relies on sparse rewards provided only after complete code evaluation, limiting learning efficiency and incremental improvements. When generated code fails all unit tests, no learning signal is received, hindering progress on complex tasks. To address this, we propose a Process Reward Model (PRM) that delivers dense, line-level feedback on code correctness during generation, mimicking human code refinement and providing immediate guidance. We explore various strategies for training PRMs and integrating them into the RL framework, finding that using PRMs both as dense rewards and for value function initialization significantly boosts performance. Our experimental results also highlight the effectiveness of PRMs in enhancing RL-driven code generation, especially for long-horizon scenarios.
