Sharpness-Controlled Group Relative Policy Optimization with Token-Level Probability Shaping
Tue Le, Nghi D. Q. Bui, Linh Ngo Van, Trung Le
TL;DR
The paper identifies a generalization challenge in GRPO where rare, low-probability tokens can dominate gradient updates, harming LLM reasoning generalization. It introduces Token-Regulated GRPO (TR-GRPO), which applies monotone, probability-aware weights to token updates, thereby reducing gradient sharpness while preserving signals from semantically critical tokens. The authors provide theoretical bounds linking token probabilities to gradient magnitudes and confirm empirically that TR-GRPO yields stronger, more stable learning across logic puzzles, mathematical reasoning, and agentic tool-augmented QA, with notable gains in accuracy and smoother gradient trajectories. TR-GRPO offers a simple, effective generalization-oriented upgrade to GRPO for RLVR with practical overhead considerations and broad applicability across domains and models.
Abstract
Reinforcement learning with verifiable rewards (RLVR) has become a practical route to improve large language model reasoning, and Group Relative Policy Optimization (GRPO) is a widely used optimizer in this setting. This paper revisits GRPO from a generalization perspective. Recent analysis shows that population performance can be controlled by a robust empirical objective that decomposes into the training loss plus a sharpness term measured by the gradient norm. We develop a token-level view of this sharpness term and show that GRPO can be dominated by a small subset of tokens with disproportionately large per-token gradients, which increases sharpness and can harm generalization. Motivated by this view, we propose Token-Regulated GRPO (TR-GRPO), which introduces a monotone probability shaping function to assign token weights based on the model's own token probabilities, and integrates these weights into the standard GRPO. Our analysis yields a bound that isolates a probability dependent multiplicative factor in token-gradient magnitudes, explaining how probability-aware weighting suppresses sharp directions while preserving learning signal on semantically critical tokens. Experiments on logic puzzles, mathematical reasoning, and tool-augmented question answering show consistent improvements over GRPO, along with smoother gradient-norm trajectories, supporting TR-GRPO as a simple and effective generalization-oriented upgrade to GRPO for RLVR.
