Table of Contents
Fetching ...

Sharpness-Controlled Group Relative Policy Optimization with Token-Level Probability Shaping

Tue Le, Nghi D. Q. Bui, Linh Ngo Van, Trung Le

TL;DR

The paper identifies a generalization challenge in GRPO where rare, low-probability tokens can dominate gradient updates, harming LLM reasoning generalization. It introduces Token-Regulated GRPO (TR-GRPO), which applies monotone, probability-aware weights to token updates, thereby reducing gradient sharpness while preserving signals from semantically critical tokens. The authors provide theoretical bounds linking token probabilities to gradient magnitudes and confirm empirically that TR-GRPO yields stronger, more stable learning across logic puzzles, mathematical reasoning, and agentic tool-augmented QA, with notable gains in accuracy and smoother gradient trajectories. TR-GRPO offers a simple, effective generalization-oriented upgrade to GRPO for RLVR with practical overhead considerations and broad applicability across domains and models.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. This paper revisits GRPO from a generalization perspective. Recent analysis shows that population performance can be controlled by a robust empirical objective that decomposes into the training loss plus a sharpness term measured by the gradient norm. We develop a token-level view of this sharpness term and show that GRPO can be dominated by a small subset of tokens with disproportionately large per-token gradients, which increases sharpness and can harm generalization. Motivated by this view, we propose Token-Regulated GRPO (TR-GRPO), which introduces a monotone probability shaping function to assign token weights based on the model's own token probabilities, and integrates these weights into the standard GRPO. Our analysis yields a bound that isolates a probability dependent multiplicative factor in token-gradient magnitudes, explaining how probability-aware weighting suppresses sharp directions while preserving learning signal on semantically critical tokens. Experiments on logic puzzles, mathematical reasoning, and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting TR-GRPO as a simple and effective generalization-oriented upgrade to GRPO for RLVR.

Sharpness-Controlled Group Relative Policy Optimization with Token-Level Probability Shaping

TL;DR

The paper identifies a generalization challenge in GRPO where rare, low-probability tokens can dominate gradient updates, harming LLM reasoning generalization. It introduces Token-Regulated GRPO (TR-GRPO), which applies monotone, probability-aware weights to token updates, thereby reducing gradient sharpness while preserving signals from semantically critical tokens. The authors provide theoretical bounds linking token probabilities to gradient magnitudes and confirm empirically that TR-GRPO yields stronger, more stable learning across logic puzzles, mathematical reasoning, and agentic tool-augmented QA, with notable gains in accuracy and smoother gradient trajectories. TR-GRPO offers a simple, effective generalization-oriented upgrade to GRPO for RLVR with practical overhead considerations and broad applicability across domains and models.

Abstract

Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. This paper revisits GRPO from a generalization perspective. Recent analysis shows that population performance can be controlled by a robust empirical objective that decomposes into the training loss plus a sharpness term measured by the gradient norm. We develop a token-level view of this sharpness term and show that GRPO can be dominated by a small subset of tokens with disproportionately large per-token gradients, which increases sharpness and can harm generalization. Motivated by this view, we propose Token-Regulated GRPO (TR-GRPO), which introduces a monotone probability shaping function to assign token weights based on the model's own token probabilities, and integrates these weights into the standard GRPO. Our analysis yields a bound that isolates a probability dependent multiplicative factor in token-gradient magnitudes, explaining how probability-aware weighting suppresses sharp directions while preserving learning signal on semantically critical tokens. Experiments on logic puzzles, mathematical reasoning, and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting TR-GRPO as a simple and effective generalization-oriented upgrade to GRPO for RLVR.

Paper Structure

This paper contains 33 sections, 3 theorems, 38 equations, 4 figures, 9 tables.

Key Result

Theorem 3.1

With probability at least $1-\delta$ over $\mathcal{S}\sim \mathcal{D}^N$,

Figures (4)

  • Figure 1: Word clouds of the top 100 high- vs. low-probability tokens selected from frequently occurring words. High-probability tokens (left) primarily consist of mathematical and logical operators, brackets, and variable names, where even small errors can invalidate an entire solution, whereas low-probability tokens (right) mostly consist of generic content words that are less critical.
  • Figure 2: Accuracy on the K&K Logic Puzzles benchmark, broken down by puzzle size (3--7 people). TR-GRPO consistently achieves higher accuracy than GRPO across all difficulty levels, while the Reverse variant that emphasizes low-probability tokens yields performance comparable to GRPO without clear improvement.
  • Figure 3: Gradient norm trajectories during training under GRPO vs. TR-GRPO across three RLVR settings. TR-GRPO consistently exhibits lower variability and fewer spikes than GRPO, consistent with reduced sharpness in Eq. (\ref{['eq:sharpness_approx']}) and the bound in Eq. (\ref{['eq:our_grad_norm']}).
  • Figure 4: Training reward trajectories during training under GRPO vs. TR-GRPO across three RLVR settings. TR-GRPO achieves higher reward while also exhibiting lower sharpness as reflected by gradient norms.

Theorems & Definitions (4)

  • Theorem 3.1: Generalization via local robustness
  • Theorem 3.2: Token gradient bound for TR-GRPO
  • Lemma D.1
  • proof