Table of Contents
Fetching ...

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

Jie Cheng, Gang Xiong, Ruixi Qiao, Lijun Li, Chao Guo, Junle Wang, Yisheng Lv, Fei-Yue Wang

TL;DR

This work tackles reward hacking in process reward models (PRMs) used for reinforcement fine-tuning of large language models. It introduces PURE (Process sUpervised Reinforcement lEarning), a min-form credit assignment that defines the value as the minimum of future process rewards, bounding the value range and reducing incentives to game high-reward steps. Empirically, PURE matches or exceeds verifiable-reward baselines using only about 30% of training steps, and combining PRMs with sparse verifiable rewards yields the best performance (e.g., 82.5% AMC23 accuracy and 53.3% average across five benchmarks with Qwen2.5-Math-7B). The paper also analyzes reward hacking types and training collapse, showing that while min-form mitigates many issues, a small amount of ground-truth supervision further stabilizes training; it releases code and weights to facilitate reproducibility and further research.

Abstract

Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. We release our code and model weights at https://github.com/CJReinforce/PURE.

Stop Summation: Min-Form Credit Assignment Is All Process Reward Model Needs for Reasoning

TL;DR

This work tackles reward hacking in process reward models (PRMs) used for reinforcement fine-tuning of large language models. It introduces PURE (Process sUpervised Reinforcement lEarning), a min-form credit assignment that defines the value as the minimum of future process rewards, bounding the value range and reducing incentives to game high-reward steps. Empirically, PURE matches or exceeds verifiable-reward baselines using only about 30% of training steps, and combining PRMs with sparse verifiable rewards yields the best performance (e.g., 82.5% AMC23 accuracy and 53.3% average across five benchmarks with Qwen2.5-Math-7B). The paper also analyzes reward hacking types and training collapse, showing that while min-form mitigates many issues, a small amount of ground-truth supervision further stabilizes training; it releases code and weights to facilitate reproducibility and further research.

Abstract

Process reward models (PRMs) have proven effective for test-time scaling of Large Language Models (LLMs) on challenging reasoning tasks. However, reward hacking issues with PRMs limit their successful application in reinforcement fine-tuning. In this paper, we identify the main cause of PRM-induced reward hacking: the canonical summation-form credit assignment in reinforcement learning (RL), which defines the value as cumulative gamma-decayed future rewards, easily induces LLMs to hack steps with high rewards. To address this, we propose PURE: Process sUpervised Reinforcement lEarning. The key innovation of PURE is a min-form credit assignment that formulates the value function as the minimum of future rewards. This method significantly alleviates reward hacking by limiting the value function range and distributing advantages more reasonably. Through extensive experiments on 3 base models, we show that PRM-based approaches enabling min-form credit assignment achieve comparable reasoning performance to verifiable reward-based methods within only 30% steps. In contrast, the canonical sum-form credit assignment collapses training even at the beginning! Additionally, when we supplement PRM-based fine-tuning with just 10% verifiable rewards, we further alleviate reward hacking and produce the best fine-tuned model based on Qwen2.5-Math-7B in our experiments, achieving 82.5% accuracy on AMC23 and 53.3% average accuracy across 5 benchmarks. Moreover, we summarize the observed reward hacking cases and analyze the causes of training collapse. We release our code and model weights at https://github.com/CJReinforce/PURE.

Paper Structure

This paper contains 31 sections, 2 theorems, 15 equations, 8 figures, 5 tables.

Key Result

Theorem 1

Under Assumptions 1 and 2, for any state-action pair $(s_t, a_t)$ and a trajectory $\tau$ with $n$ reasoning steps:

Figures (8)

  • Figure 1: Comparison of summation-form and min-form credit assignment. Adv. and Process reward* in the table means advantage and transformed process reward, respectively. The incorrect steps in the rollout are highlighted in red, and our PRM reasonably assigns negative scores to these steps. For simplicity, advantage baseline and KL penalty terms are omitted in advantage calculation here, and discount factor $\gamma$ and transform temperature $T$ are set to 1. Arrows indicate changes in sampling probability, with larger changes marked by contoured arrows.
  • Figure 2: Training curves for different variants of our methods on Qwen2.5-Math series. Curves of PURE-PRM (sum-form) and PURE-PRM+VR (sum-form) are truncated due to training collapse. Process-aggregated outcome reward is the summation of final process rewards for one response: for sum-form, it sums PRM-emitted rewards; for min-form, it sums transformed rewards, approximating the minimum PRM-emitted reward. Thus values across the 2 credit assignment methods are not comparable. For PURE-PRM, verifiable reward is logged but unused in training.
  • Figure 3: Training curves for PURE-PRM+VR with doubled process rewards based on Qwen2.5-7B. The correctness of responses are judged by the verifier. Process-aggregated outcome reward (bottom-left) is the summation of transformed process rewards for one response. No smooth is applied. Training collapses at step 365, showing a sharp drop of rewards and accuracy.
  • Figure 4: Reward hacking, case 1: only thinking, not solving. In this example, the LLM analyzes the problem and gives a few equations for trigonometric simplifications, but does not substitute actual numbers to calculate and solve the problem. This is because the LLM hacks the implicit pattern inside high-reward steps, i.e., thinking.
  • Figure 5: Reward hacking, case 2: extremely few steps (1 step). In practice, we divide steps according to double line breaks "$\backslash$n$\backslash$n" and then PRM scores each step. When the advantage baseline is inappropriate, such as step-level baseline discussed in § \ref{['sec:reward hacking']}, the model learns to deliberately avoid outputting "$\backslash$n$\backslash$n", preferring short-step response. In this example, there is no "$\backslash$n$\backslash$n" character in the generated response, resulting in the entire response being split into only one step.
  • ...and 3 more figures

Theorems & Definitions (3)

  • Theorem 1: Q-Value Estimation Error Bound Comparison
  • Theorem 2: Q-Value Estimation Error Bound Comparison
  • proof : Proof