GRPO is Secretly a Process Reward Model

Michael Sullivan

GRPO is Secretly a Process Reward Model

Michael Sullivan

TL;DR

The paper analyzes GRPO and shows that, under prefix-overlap assumptions, the standard GRPO objective implicitly induces a Monte Carlo–style PRM rather than relying on explicit step-level rewards. It formalizes the PRM induced by GRPO, proves equivalence to the PRM objective under certain conditions, and provides empirical evidence that rich, non-trivial process-step rewards arise frequently in practice. It then identifies a defect caused by non-uniform distribution of process steps and introduces λ-GRPO, a normalization that equalizes contribution across process sets, yielding faster convergence and better downstream reasoning across models and tasks. The work challenges the necessity of costly, explicitly-defined PRMs for GRPO by showing that the built-in PRM structure can be leveraged with minimal overhead to boost performance in multi-step reasoning benchmarks.

Abstract

We prove theoretically that the GRPO RL algorithm induces a non-trivial process reward model (PRM), under certain assumptions regarding within-group overlap of token sequences across completions. We then show empirically that these assumptions are met under real-world conditions: GRPO does in fact induce a non-trivial PRM. Leveraging the framework of GRPO-as-a-PRM, we identify a flaw in the GRPO objective: non-uniformly distributed process steps hinder both exploration and exploitation (under different conditions). We propose a simple modification to the algorithm to mitigate this defect ($λ$-GRPO), and show that LLMs trained with $λ$-GRPO achieve higher validation accuracy and performance on downstream reasoning tasks$-$and reach peak performance more rapidly$-$than LLMs trained with standard GRPO. Our results call into question the advantage of costly, explicitly-defined PRMs for GRPO: we show that it is possible to instead leverage the hidden, built-in PRM structure within the vanilla GRPO algorithm to boost model performance with a negligible impact on training time and cost.

GRPO is Secretly a Process Reward Model

TL;DR

Abstract

GRPO is Secretly a Process Reward Model

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (6)

Theorems & Definitions (2)