Table of Contents
Fetching ...

WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning

Gagan Mundada, Zihan Huang, Rohan Surana, Sheldon Yu, Jennifer Yuntong Zhang, Xintong Li, Tong Yu, Lina Yao, Jingbo Shang, Julian McAuley, Junda Wu

TL;DR

WS-GRPO tackles the rollout efficiency problem in group-relative policy optimization by turning sparse final-answer correctness into dense, prefix-level guidance. It introduces a two-phase weakly supervised framework: Phase I learns a trajectory-quality preference from outcome labels, and Phase II uses this preference to generate prefix-level pseudo-rewards that guide policy optimization within the GRPO objective. The approach provides theoretical guarantees (consistency, robustness to preference errors, generalization) and demonstrates substantial reductions in rollout length with competitive accuracy across diverse reasoning benchmarks. This results in more concise, reliable reasoning while maintaining performance, offering a practical path to efficient multi-step reasoning in large language models.

Abstract

Group Relative Policy Optimization (GRPO) is effective for training language models on complex reasoning. However, since the objective is defined relative to a group of sampled trajectories, extended deliberation can create more chances to realize relative gains, leading to inefficient reasoning and overthinking, and complicating the trade-off between correctness and rollout efficiency. Controlling this behavior is difficult in practice, considering (i) Length penalties are hard to calibrate because longer rollouts may reflect harder problems that require longer reasoning, penalizing tokens risks truncating useful reasoning along with redundant continuation; and (ii) supervision that directly indicates when to continue or stop is typically unavailable beyond final answer correctness. We propose Weakly Supervised GRPO (WS-GRPO), which improves rollout efficiency by converting terminal rewards into correctness-aware guidance over partial trajectories. Unlike global length penalties that are hard to calibrate, WS-GRPO trains a preference model from outcome-only correctness to produce prefix-level signals that indicate when additional continuation is beneficial. Thus, WS-GRPO supplies outcome-derived continue/stop guidance, reducing redundant deliberation while maintaining accuracy. We provide theoretical results and empirically show on reasoning benchmarks that WS-GRPO substantially reduces rollout length while remaining competitive with GRPO baselines.

WS-GRPO: Weakly-Supervised Group-Relative Policy Optimization for Rollout-Efficient Reasoning

TL;DR

WS-GRPO tackles the rollout efficiency problem in group-relative policy optimization by turning sparse final-answer correctness into dense, prefix-level guidance. It introduces a two-phase weakly supervised framework: Phase I learns a trajectory-quality preference from outcome labels, and Phase II uses this preference to generate prefix-level pseudo-rewards that guide policy optimization within the GRPO objective. The approach provides theoretical guarantees (consistency, robustness to preference errors, generalization) and demonstrates substantial reductions in rollout length with competitive accuracy across diverse reasoning benchmarks. This results in more concise, reliable reasoning while maintaining performance, offering a practical path to efficient multi-step reasoning in large language models.

Abstract

Group Relative Policy Optimization (GRPO) is effective for training language models on complex reasoning. However, since the objective is defined relative to a group of sampled trajectories, extended deliberation can create more chances to realize relative gains, leading to inefficient reasoning and overthinking, and complicating the trade-off between correctness and rollout efficiency. Controlling this behavior is difficult in practice, considering (i) Length penalties are hard to calibrate because longer rollouts may reflect harder problems that require longer reasoning, penalizing tokens risks truncating useful reasoning along with redundant continuation; and (ii) supervision that directly indicates when to continue or stop is typically unavailable beyond final answer correctness. We propose Weakly Supervised GRPO (WS-GRPO), which improves rollout efficiency by converting terminal rewards into correctness-aware guidance over partial trajectories. Unlike global length penalties that are hard to calibrate, WS-GRPO trains a preference model from outcome-only correctness to produce prefix-level signals that indicate when additional continuation is beneficial. Thus, WS-GRPO supplies outcome-derived continue/stop guidance, reducing redundant deliberation while maintaining accuracy. We provide theoretical results and empirically show on reasoning benchmarks that WS-GRPO substantially reduces rollout length while remaining competitive with GRPO baselines.
Paper Structure (27 sections, 6 theorems, 65 equations, 6 figures, 3 tables, 2 algorithms)

This paper contains 27 sections, 6 theorems, 65 equations, 6 figures, 3 tables, 2 algorithms.

Key Result

Theorem 3.1

Let $P_{\omega^*}$ be the optimal preference model trained with complete step-level annotations, and $P_{\hat{\omega}_n}$ be our weakly-supervised preference model trained on $n$ trajectory pairs with only outcome-level supervision. Under regularity conditions, the preference model error satisfies: with probability at least $1-\delta$, where $d_P$ is the VC-dimension of the preference model class

Figures (6)

  • Figure 1: WS-GRPO for efficient reasoning. GRPO's group-relative objective can favor longer trajectories, while simple length penalties are hard to calibrate and human supervision is expensive (top). WS-GRPO uses a preference model trained on correct vs. incorrect outcomes to compare consecutive prefixes, generating correctness-aware guidance that reduces redundant continuation while preserving necessary reasoning (bottom). Some icons were generated by a generative AI tool (ChatGPT) and are for illustrative purposes only.
  • Figure 2: WS-GRPO Framework Overview:Phase 1: using outcome-only correctness (final-answer labels), we construct preference pairs from correct vs. incorrect rollouts and train a preference model $P_\omega$. Phase 2: the frozen $P_\omega$ converts terminal correctness into correctness-aware, prefix-level rewards by comparing consecutive prefixes within each rollout. These dense prefix-level rewards are combined with the terminal reward and used in the GRPO objective to refine the policy, improving rollout efficiency.
  • Figure 3: Figure on the left demonstrates the Trajectory Length Distribution of samples used in Analysis from GSM8K, middle figure shows the Absolute Score Difference over steps, on the right is the Average combined reward score as a function of trajectory length.
  • Figure 4: Validation step-efficiency (Pass@1 / average reasoning steps) across training for Qwen models. Higher values indicate greater accuracy per reasoning step. WS-GRPO consistently achieves higher step-efficiency.
  • Figure 5: Validation step-efficiency (Pass@1 / average reasoning steps) across training for Llama models. Higher values indicate greater accuracy per reasoning step.
  • ...and 1 more figures

Theorems & Definitions (6)

  • Theorem 3.1: Preference Model Consistency
  • Theorem 3.2: Policy Robustness to Preference Errors
  • Theorem 3.3: WS-GRPO Generalization Bound
  • Theorem 1.1: Preference Model Consistency
  • Theorem 1.2: Policy Robustness to Preference Errors
  • Theorem 1.3: WS-GRPO Generalization Bound