Table of Contents
Fetching ...

Posterior-GRPO: Rewarding Reasoning Processes in Code Generation

Lishui Fan, Yu Zhang, Mouxiang Chen, Zhongxin Liu

TL;DR

This work addresses the gap in reinforcement learning for code generation by focusing on the quality of intermediate reasoning rather than solely on final outcomes. It introduces LCB-RB, a benchmark for discriminating reasoning quality; an Optimized-Degraded Based method to train multi-dimensional thinking rewards; and Posterior-GRPO, a reinforcement learning algorithm that safely integrates thinking rewards with outcome rewards to mitigate reward hacking. Empirical results show state-of-the-art performance on reasoning-focused benchmarks and strong generalization to mathematical tasks, with notable gains over outcome-only baselines. The approach provides a principled pathway to align internal reasoning with final correctness, offering practical improvements in diverse code-generation tasks and promising directions for scaling reasoning-aware RL.

Abstract

Reinforcement learning (RL) has significantly advanced code generation for large language models (LLMs). However, current paradigms rely on outcome-based rewards from test cases, neglecting the quality of the intermediate reasoning process. While supervising the reasoning process directly is a promising direction, it is highly susceptible to reward hacking, where the policy model learns to exploit the reasoning reward signal without improving final outcomes. To address this, we introduce a unified framework that can effectively incorporate the quality of the reasoning process during RL. First, to enable reasoning evaluation, we develop LCB-RB, a benchmark comprising preference pairs of superior and inferior reasoning processes. Second, to accurately score reasoning quality, we introduce an Optimized-Degraded based (OD-based) method for reward model training. This method generates high-quality preference pairs by systematically optimizing and degrading initial reasoning paths along curated dimensions of reasoning quality, such as factual accuracy, logical rigor, and coherence. A 7B parameter reward model with this method achieves state-of-the-art (SOTA) performance on LCB-RB and generalizes well to other benchmarks. Finally, we introduce Posterior-GRPO (P-GRPO), a novel RL method that conditions process-based rewards on task success. By selectively applying rewards to the reasoning processes of only successful outcomes, P-GRPO effectively mitigates reward hacking and aligns the model's internal reasoning with final code correctness. A 7B parameter model with P-GRPO achieves superior performance across diverse code generation tasks, outperforming outcome-only baselines by 4.5%, achieving comparable performance to GPT-4-Turbo. We further demonstrate the generalizability of our approach by extending it to mathematical tasks. Our models, dataset, and code are publicly available.

Posterior-GRPO: Rewarding Reasoning Processes in Code Generation

TL;DR

This work addresses the gap in reinforcement learning for code generation by focusing on the quality of intermediate reasoning rather than solely on final outcomes. It introduces LCB-RB, a benchmark for discriminating reasoning quality; an Optimized-Degraded Based method to train multi-dimensional thinking rewards; and Posterior-GRPO, a reinforcement learning algorithm that safely integrates thinking rewards with outcome rewards to mitigate reward hacking. Empirical results show state-of-the-art performance on reasoning-focused benchmarks and strong generalization to mathematical tasks, with notable gains over outcome-only baselines. The approach provides a principled pathway to align internal reasoning with final correctness, offering practical improvements in diverse code-generation tasks and promising directions for scaling reasoning-aware RL.

Abstract

Reinforcement learning (RL) has significantly advanced code generation for large language models (LLMs). However, current paradigms rely on outcome-based rewards from test cases, neglecting the quality of the intermediate reasoning process. While supervising the reasoning process directly is a promising direction, it is highly susceptible to reward hacking, where the policy model learns to exploit the reasoning reward signal without improving final outcomes. To address this, we introduce a unified framework that can effectively incorporate the quality of the reasoning process during RL. First, to enable reasoning evaluation, we develop LCB-RB, a benchmark comprising preference pairs of superior and inferior reasoning processes. Second, to accurately score reasoning quality, we introduce an Optimized-Degraded based (OD-based) method for reward model training. This method generates high-quality preference pairs by systematically optimizing and degrading initial reasoning paths along curated dimensions of reasoning quality, such as factual accuracy, logical rigor, and coherence. A 7B parameter reward model with this method achieves state-of-the-art (SOTA) performance on LCB-RB and generalizes well to other benchmarks. Finally, we introduce Posterior-GRPO (P-GRPO), a novel RL method that conditions process-based rewards on task success. By selectively applying rewards to the reasoning processes of only successful outcomes, P-GRPO effectively mitigates reward hacking and aligns the model's internal reasoning with final code correctness. A 7B parameter model with P-GRPO achieves superior performance across diverse code generation tasks, outperforming outcome-only baselines by 4.5%, achieving comparable performance to GPT-4-Turbo. We further demonstrate the generalizability of our approach by extending it to mathematical tasks. Our models, dataset, and code are publicly available.

Paper Structure

This paper contains 27 sections, 1 equation, 6 figures, 3 tables.

Figures (6)

  • Figure 1: An overview of OD-based method.
  • Figure 2: The overview of P-GRPO. It adopts a posterior-based strategy. Specifically, a thinking reward is incorporated into the total reward signal if, and only if, a rule-based reward first confirms the final answer is correct.
  • Figure 3: Example of reasoning processes generated by the base model with P-GRPO and with GRPO.
  • Figure 4: Performance comparison of the model with P-GRPO against the GRPO baseline (a, b) and ablation studies on preference sources and reward models (c, d).
  • Figure 5: The Prompt used for RL training.
  • ...and 1 more figures