Table of Contents
Fetching ...

Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective

Zeyu Jia, Alexander Rakhlin, Tengyang Xie

TL;DR

This paper analyzes the statistical relationship between process supervision (step-level feedback) and outcome supervision (trajectory-level feedback) in reinforcement learning for complex reasoning. It introduces the Change of Trajectory Measure Lemma and shows that, under standard state-action coverage, outcome supervision is not statistically harder than process supervision up to polynomial factors in the horizon $H$, enabling data transforms and algorithmic transfer from outcome data to process-based methods. It further connects outcome supervision to an optimal process reward model via the policy advantage function, extends the theory to preference-based RL (including DPO), and provides both upper and lower bounds distinguishing advantages of using the advantage function over Q-functions. The results suggest that empirical gaps between supervision paradigms may be driven mainly by algorithmic limitations rather than intrinsic statistical difficulty, with potential practical impact on data collection and RLHF design.

Abstract

As large language models have evolved, it has become crucial to distinguish between process supervision and outcome supervision -- two key reinforcement learning approaches to complex reasoning tasks. While process supervision offers intuitive advantages for long-term credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data. In this paper, we take steps towards resolving this debate. Our main theorem shows that, under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision, up to polynomial factors in horizon. At the core of this result lies the novel Change of Trajectory Measure Lemma -- a technical tool that bridges return-based trajectory measure and step-level distribution shift. Furthermore, for settings with access to a verifier or a rollout capability, we prove that any policy's advantage function can serve as an optimal process reward model, providing a direct connection between outcome and process supervision. These findings suggest that the empirically observed performance gap -- if any -- between outcome and process supervision likely stems from algorithmic limitations rather than inherent statistical difficulties, potentially transforming how we approach data collection and algorithm design for reinforcement learning.

Do We Need to Verify Step by Step? Rethinking Process Supervision from a Theoretical Perspective

TL;DR

This paper analyzes the statistical relationship between process supervision (step-level feedback) and outcome supervision (trajectory-level feedback) in reinforcement learning for complex reasoning. It introduces the Change of Trajectory Measure Lemma and shows that, under standard state-action coverage, outcome supervision is not statistically harder than process supervision up to polynomial factors in the horizon , enabling data transforms and algorithmic transfer from outcome data to process-based methods. It further connects outcome supervision to an optimal process reward model via the policy advantage function, extends the theory to preference-based RL (including DPO), and provides both upper and lower bounds distinguishing advantages of using the advantage function over Q-functions. The results suggest that empirical gaps between supervision paradigms may be driven mainly by algorithmic limitations rather than intrinsic statistical difficulty, with potential practical impact on data collection and RLHF design.

Abstract

As large language models have evolved, it has become crucial to distinguish between process supervision and outcome supervision -- two key reinforcement learning approaches to complex reasoning tasks. While process supervision offers intuitive advantages for long-term credit assignment, the precise relationship between these paradigms has remained an open question. Conventional wisdom suggests that outcome supervision is fundamentally more challenging due to the trajectory-level coverage problem, leading to significant investment in collecting fine-grained process supervision data. In this paper, we take steps towards resolving this debate. Our main theorem shows that, under standard data coverage assumptions, reinforcement learning through outcome supervision is no more statistically difficult than through process supervision, up to polynomial factors in horizon. At the core of this result lies the novel Change of Trajectory Measure Lemma -- a technical tool that bridges return-based trajectory measure and step-level distribution shift. Furthermore, for settings with access to a verifier or a rollout capability, we prove that any policy's advantage function can serve as an optimal process reward model, providing a direct connection between outcome and process supervision. These findings suggest that the empirically observed performance gap -- if any -- between outcome and process supervision likely stems from algorithmic limitations rather than inherent statistical difficulties, potentially transforming how we approach data collection and algorithm design for reinforcement learning.

Paper Structure

This paper contains 45 sections, 12 theorems, 113 equations, 2 algorithms.

Key Result

Theorem 1

Suppose the dataset $\mathcal{D}_O$ is collected i.i.d. according to policy $\pi_{\mathrm{off}}$ in the MDP $M = (\mathcal{S}, \mathcal{A}, P, r^\star, H)$ with the ground truth reward model $r^\star\in \mathcal{R}$. Then, with probability at least $1 - \delta$, for any policy $\pi$, the PRM reward where $C_{\mathsf{sa}}(\pi, \pi_{\mathrm{off}})$ is the state-action concentrability coefficient de

Theorems & Definitions (24)

  • Theorem 1
  • Corollary 2
  • Lemma 3
  • Theorem 4
  • Theorem 5
  • Theorem 6
  • Theorem 7
  • proof : Proof of \ref{['lem:orm-reward']}
  • proof : Proof of \ref{['thm: orm-prm-transformation']}
  • proof : Proof of \ref{['corr: alg-transform']}
  • ...and 14 more