Table of Contents
Fetching ...

LLM Reasoning with Process Rewards for Outcome-Guided Steps

Mohammad Rezaei, Jens Lehmann, Sahar Vahdati

Abstract

Mathematical reasoning in large language models has improved substantially with reinforcement learning using verifiable rewards, where final answers can be checked automatically and converted into reliable training signals. Most such pipelines optimize outcome correctness only, which yields sparse feedback for long, multi-step solutions and offers limited guidance on intermediate reasoning errors. Recent work therefore introduces process reward models (PRMs) to score intermediate steps and provide denser supervision. In practice, PRM scores are often imperfectly aligned with final correctness and can reward locally fluent reasoning that still ends in an incorrect answer. When optimized as absolute rewards, such signals can amplify fluent failure modes and induce reward hacking. We propose PROGRS, a framework that leverages PRMs while keeping outcome correctness dominant. PROGRS treats process rewards as relative preferences within outcome groups rather than absolute targets. We introduce outcome-conditioned centering, which shifts PRM scores of incorrect trajectories to have zero mean within each prompt group. It removes systematic bias while preserving informative rankings. PROGRS combines a frozen quantile-regression PRM with a multi-scale coherence evaluator. We integrate the resulting centered process bonus into Group Relative Policy Optimization (GRPO) without auxiliary objectives or additional trainable components. Across MATH-500, AMC, AIME, MinervaMath, and OlympiadBench, PROGRS consistently improves Pass@1 over outcome-only baselines and achieves stronger performance with fewer rollouts. These results show that outcome-conditioned centering enables safe and effective use of process rewards for mathematical reasoning.

LLM Reasoning with Process Rewards for Outcome-Guided Steps

Abstract

Mathematical reasoning in large language models has improved substantially with reinforcement learning using verifiable rewards, where final answers can be checked automatically and converted into reliable training signals. Most such pipelines optimize outcome correctness only, which yields sparse feedback for long, multi-step solutions and offers limited guidance on intermediate reasoning errors. Recent work therefore introduces process reward models (PRMs) to score intermediate steps and provide denser supervision. In practice, PRM scores are often imperfectly aligned with final correctness and can reward locally fluent reasoning that still ends in an incorrect answer. When optimized as absolute rewards, such signals can amplify fluent failure modes and induce reward hacking. We propose PROGRS, a framework that leverages PRMs while keeping outcome correctness dominant. PROGRS treats process rewards as relative preferences within outcome groups rather than absolute targets. We introduce outcome-conditioned centering, which shifts PRM scores of incorrect trajectories to have zero mean within each prompt group. It removes systematic bias while preserving informative rankings. PROGRS combines a frozen quantile-regression PRM with a multi-scale coherence evaluator. We integrate the resulting centered process bonus into Group Relative Policy Optimization (GRPO) without auxiliary objectives or additional trainable components. Across MATH-500, AMC, AIME, MinervaMath, and OlympiadBench, PROGRS consistently improves Pass@1 over outcome-only baselines and achieves stronger performance with fewer rollouts. These results show that outcome-conditioned centering enables safe and effective use of process rewards for mathematical reasoning.

Paper Structure

This paper contains 22 sections, 13 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Computational efficiency across models. (a) Pareto frontier: Accuracy vs. budget; PROGRS-4 matches DAPO baselines with $\sim50\%$ budget, PROGRS-8 matches/exceeds DAPO-16. (b) Efficiency ranking: Accuracy per token; PROGRS-8 highest, PROGRS-4 next. (c) Efficiency score: Accuracy normalized by budget; PROGRS-4 best. (d) Efficiency-budget surface: PROGRS-4 optimal for $\le 50\%$ budget, PROGRS-8 for $50\%-100\%$.
  • Figure 2: Training dynamics of DAPO vs. PROGRS: mean and std. of advantages and per-token entropy over training (smoothed with a 50-step moving average).
  • Figure 3: Pass@K accuracy across benchmarks. Each panel shows Pass@1, Pass@5, and Pass@10 for DAPO-16 (red), DAPO-8 (orange), PROGRS-4 (green), and PROGRS-8 (blue). Shaded regions indicate standard deviation across evaluation runs.