Table of Contents
Fetching ...

GRPO is Secretly a Process Reward Model

Michael Sullivan

TL;DR

The paper analyzes GRPO and shows that, under prefix-overlap assumptions, the standard GRPO objective implicitly induces a Monte Carlo–style PRM rather than relying on explicit step-level rewards. It formalizes the PRM induced by GRPO, proves equivalence to the PRM objective under certain conditions, and provides empirical evidence that rich, non-trivial process-step rewards arise frequently in practice. It then identifies a defect caused by non-uniform distribution of process steps and introduces λ-GRPO, a normalization that equalizes contribution across process sets, yielding faster convergence and better downstream reasoning across models and tasks. The work challenges the necessity of costly, explicitly-defined PRMs for GRPO by showing that the built-in PRM structure can be leveraged with minimal overhead to boost performance in multi-step reasoning benchmarks.

Abstract

We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($λ$-GRPO), and show that LLMs trained with $λ$-GRPO achieve higher validation accuracy and performance on downstream reasoning tasks$-$and reach peak performance more rapidly$-$than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.

GRPO is Secretly a Process Reward Model

TL;DR

The paper analyzes GRPO and shows that, under prefix-overlap assumptions, the standard GRPO objective implicitly induces a Monte Carlo–style PRM rather than relying on explicit step-level rewards. It formalizes the PRM induced by GRPO, proves equivalence to the PRM objective under certain conditions, and provides empirical evidence that rich, non-trivial process-step rewards arise frequently in practice. It then identifies a defect caused by non-uniform distribution of process steps and introduces λ-GRPO, a normalization that equalizes contribution across process sets, yielding faster convergence and better downstream reasoning across models and tasks. The work challenges the necessity of costly, explicitly-defined PRMs for GRPO by showing that the built-in PRM structure can be leveraged with minimal overhead to boost performance in multi-step reasoning benchmarks.

Abstract

We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect (-GRPO), and show that LLMs trained with -GRPO achieve higher validation accuracy and performance on downstream reasoning tasksand reach peak performance more rapidlythan LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.

Paper Structure

This paper contains 20 sections, 1 theorem, 12 equations, 6 figures, 2 tables.

Key Result

Theorem 1

For any query $q$, policy $\pi_\theta$, and group ${\mathbb{G}}\sim\pi_\theta(\text{\textendash}\space|\space q)$ with outcome-level rewards $\{r_i\}_{g^{(i)}_{}\in{\mathbb{G}}}$: $L_\text{GRPO}({\mathbb{G}})=L_\text{PRM}({\mathbb{G}})$.

Figures (6)

  • Figure 1: Toy example of a group ${\mathbb{G}}=\{g^{(i)}_{},\dots,g^{(i)}_{}\}$ (left) and its corresponding ${\mathcal{B}}({\mathbb{G}})$ tree (right). Tokens (boxes) are numbered for readability— subscripted numbers within boxes only indicate position. Each process set (node in the ${\mathcal{B}}({\mathbb{G}})$ tree) is a set of trajectories that share a common prefix, and corresponds to a process step (subtrajectory) spanning those shared tokens: in this figure, colored nodes in ${\mathcal{B}}({\mathbb{G}})$ correspond to those subsequences in ${\mathbb{G}}$ that span tokens/boxes of the same color. GRPO implicitly assigns a step-level reward and advantage to the tokens of each process step, which are computed as functions of the mean outcome-level reward of each trajectory in the corresponding process set.
  • Figure 2: Validation reward (exact-match accuracy; left), ${\mathcal{B}}({\mathbb{G}})$ root-to-terminal path depth (center), and proportions of trajectories spanned by intermediate (non-terminal) process steps (right) for GRPO runs with group sizes of 36 (top) and 6 (bottom).
  • Figure 3: Models' validation accuracy across training steps. Peak accuracy is highlighted by vertical, dashed lines.
  • Figure 4: ${\mathcal{B}}({\mathbb{G}})$ structure from step 1 (see the beginning of Appendix \ref{['app_bg_examples']} for additional details).
  • Figure 5: ${\mathcal{B}}({\mathbb{G}})$ structure from step 1001 (see the beginning of Appendix \ref{['app_bg_examples']} for additional details).
  • ...and 1 more figures

Theorems & Definitions (2)

  • Theorem 1
  • proof