Table of Contents
Fetching ...

Process-Supervised Reinforcement Learning for Code Generation

Yufan Ye, Ting Zhang, Wenbin Jiang, Hua Huang

TL;DR

PRLCoder addresses the challenge of effectively training code-generation models with sparse outcome signals by introducing process-supervised reinforcement learning. It automates line-level process supervision via a mutation/refactoring-compile-execution strategy, training a process-supervised reward model (PRM) and comparing against outcome-supervised rewards (ORM) within a PPO framework. Empirical evaluation on MBPP and HumanEval (MBPP+ augmentation) shows PRLCoder yields substantial improvements, especially on complex tasks, and demonstrates faster, more stable training than outcome-supervised methods. The approach reduces manual labeling costs and provides fine-grained guidance to the generation process, with broad potential applicability to reasoning and planning tasks beyond code.

Abstract

Existing reinforcement learning strategies based on outcome supervision have proven effective in enhancing the performance of large language models(LLMs) for code generation. While reinforcement learning based on process supervision has shown great promise in handling multi-step reasoning tasks, its effectiveness in code generation remains largely underexplored and underjustified. The primary obstacle stems from the resource-intensive nature of constructing high-quality process-supervised data, which demands substantial human expertise and computational resources. In response to this challenge, we propose a "statement mutation/refactoring-compile and execution verification" strategy: mutating and refactoring code line-by-line through a teacher model, and utilizing compiler execution results to automatically label each line, resulting in line-by-line process-supervised data, which is pivotal for training a process-supervised reward model. The trained reward model is then integrated into the PRLCoder framework, followed by experimental validation on several benchmarks. Experimental results demonstrate that process-supervised reinforcement learning significantly surpasses methods relying solely on outcome supervision. Notably, in tackling complex code generation tasks, process-supervised reinforcement learning shows a clear advantage, ensuring both the integrity of the code generation process and the correctness of the generation results.

Process-Supervised Reinforcement Learning for Code Generation

TL;DR

PRLCoder addresses the challenge of effectively training code-generation models with sparse outcome signals by introducing process-supervised reinforcement learning. It automates line-level process supervision via a mutation/refactoring-compile-execution strategy, training a process-supervised reward model (PRM) and comparing against outcome-supervised rewards (ORM) within a PPO framework. Empirical evaluation on MBPP and HumanEval (MBPP+ augmentation) shows PRLCoder yields substantial improvements, especially on complex tasks, and demonstrates faster, more stable training than outcome-supervised methods. The approach reduces manual labeling costs and provides fine-grained guidance to the generation process, with broad potential applicability to reasoning and planning tasks beyond code.

Abstract

Existing reinforcement learning strategies based on outcome supervision have proven effective in enhancing the performance of large language models(LLMs) for code generation. While reinforcement learning based on process supervision has shown great promise in handling multi-step reasoning tasks, its effectiveness in code generation remains largely underexplored and underjustified. The primary obstacle stems from the resource-intensive nature of constructing high-quality process-supervised data, which demands substantial human expertise and computational resources. In response to this challenge, we propose a "statement mutation/refactoring-compile and execution verification" strategy: mutating and refactoring code line-by-line through a teacher model, and utilizing compiler execution results to automatically label each line, resulting in line-by-line process-supervised data, which is pivotal for training a process-supervised reward model. The trained reward model is then integrated into the PRLCoder framework, followed by experimental validation on several benchmarks. Experimental results demonstrate that process-supervised reinforcement learning significantly surpasses methods relying solely on outcome supervision. Notably, in tackling complex code generation tasks, process-supervised reinforcement learning shows a clear advantage, ensuring both the integrity of the code generation process and the correctness of the generation results.

Paper Structure

This paper contains 22 sections, 3 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Illustrating the overall framework of our PRLCoder with three-phase structure: supervised training, reward model training (including ORM for comparison), and reinforcement learning employing the trained reward model.
  • Figure 2: The schematic diagram of the method for automatically constructing the reward dataset for process supervision in the field of code generation. The bolded portions represent code statements that have been mutated or refactored by DeepSeek-Coder-V2, and the subsequent statements will undergo mask processing.
  • Figure 3: Training of two types of ORM.
  • Figure 4: Quantitative analysis of the process-supervised reward model for code trained using our method.
  • Figure 5: The loss curves of the reinforcement learning under three different supervision methods.
  • ...and 4 more figures