Table of Contents
Fetching ...

$λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences

Yining Wang, Jinman Zhao, Chuangxin Zhao, Shuhao Guan, Gerald Penn, Shinan Liu

TL;DR

λ-GRPO introduces a learnable, context-aware token-preference mechanism for RLVR-based policy optimization, unifying GRPO, DAPO, and Dr. GRPO under a single framework. It computes standardized length deviations $z_i$, lifts them via $h_i = 1 + r z_i$, and uses an exponent $\lambda$ to form $g_i = h_i^{\lambda}$, which is softmax-normalized to produce token-group weights, enabling adaptive weighting of token-level losses. The learnable parameter $\lambda$ biases the optimization toward longer or shorter outputs, controlled by $r$ and stabilized by softmax normalization; the objective remains a clipped surrogate form applied at the token level. Empirically, λ-GRPO yields consistent improvements on eight math-reasoning benchmarks across Qwen2.5 sizes (1.5B, 3B, 7B), with higher entropy and stable training, while not increasing data or compute requirements. This approach offers a practical, interpretable method to balance verbosity and factual correctness in verifiable-reward RL training.

Abstract

Reinforcement Learning with Human Feedback (RLHF) has been the dominant approach for improving the reasoning capabilities of Large Language Models (LLMs). Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has simplified this paradigm by replacing the reward and value models with rule-based verifiers. A prominent example is Group Relative Policy Optimization (GRPO). However, GRPO inherently suffers from a length bias, since the same advantage is uniformly assigned to all tokens of a response. As a result, longer responses distribute the reward over more tokens and thus contribute disproportionately to gradient updates. Several variants, such as DAPO and Dr. GRPO, modify the token-level aggregation of the loss, yet these methods remain heuristic and offer limited interpretability regarding their implicit token preferences. In this work, we explore the possibility of allowing the model to learn its own token preference during optimization. We unify existing frameworks under a single formulation and introduce a learnable parameter $λ$ that adaptively controls token-level weighting. We use $λ$-GRPO to denote our method, and we find that $λ$-GRPO achieves consistent improvements over vanilla GRPO and DAPO on multiple mathematical reasoning benchmarks. On Qwen2.5 models with 1.5B, 3B, and 7B parameters, $λ$-GRPO improves average accuracy by $+1.9\%$, $+1.0\%$, and $+1.7\%$ compared to GRPO, respectively. Importantly, these gains come without any modifications to the training data or additional computational cost, highlighting the effectiveness and practicality of learning token preferences.

$λ$-GRPO: Unifying the GRPO Frameworks with Learnable Token Preferences

TL;DR

λ-GRPO introduces a learnable, context-aware token-preference mechanism for RLVR-based policy optimization, unifying GRPO, DAPO, and Dr. GRPO under a single framework. It computes standardized length deviations , lifts them via , and uses an exponent to form , which is softmax-normalized to produce token-group weights, enabling adaptive weighting of token-level losses. The learnable parameter biases the optimization toward longer or shorter outputs, controlled by and stabilized by softmax normalization; the objective remains a clipped surrogate form applied at the token level. Empirically, λ-GRPO yields consistent improvements on eight math-reasoning benchmarks across Qwen2.5 sizes (1.5B, 3B, 7B), with higher entropy and stable training, while not increasing data or compute requirements. This approach offers a practical, interpretable method to balance verbosity and factual correctness in verifiable-reward RL training.

Abstract

Reinforcement Learning with Human Feedback (RLHF) has been the dominant approach for improving the reasoning capabilities of Large Language Models (LLMs). Recently, Reinforcement Learning with Verifiable Rewards (RLVR) has simplified this paradigm by replacing the reward and value models with rule-based verifiers. A prominent example is Group Relative Policy Optimization (GRPO). However, GRPO inherently suffers from a length bias, since the same advantage is uniformly assigned to all tokens of a response. As a result, longer responses distribute the reward over more tokens and thus contribute disproportionately to gradient updates. Several variants, such as DAPO and Dr. GRPO, modify the token-level aggregation of the loss, yet these methods remain heuristic and offer limited interpretability regarding their implicit token preferences. In this work, we explore the possibility of allowing the model to learn its own token preference during optimization. We unify existing frameworks under a single formulation and introduce a learnable parameter that adaptively controls token-level weighting. We use -GRPO to denote our method, and we find that -GRPO achieves consistent improvements over vanilla GRPO and DAPO on multiple mathematical reasoning benchmarks. On Qwen2.5 models with 1.5B, 3B, and 7B parameters, -GRPO improves average accuracy by , , and compared to GRPO, respectively. Importantly, these gains come without any modifications to the training data or additional computational cost, highlighting the effectiveness and practicality of learning token preferences.

Paper Structure

This paper contains 30 sections, 24 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: The key distinction among GRPO, DAPO, and $\lambda$-GRPO lies in their token-aggregation schemes (highlighted in the shaded region). Across Qwen2.5 models of multiple sizes (1.5B, 3B, and 7B) and 8 benchmarks, $\lambda$-GRPO consistently outperforms others, with a learnable parameter $\lambda$.
  • Figure 2: Illustration of our learnable preference design. (a) shows how the scaling factor $1/k$ influences the distribution of normalized lengths $h$. Larger scaling values produce a wider spread, magnifying differences between long and short responses, smaller values compress the distribution around $1$, reducing sensitivity to length deviations. (b) shows how the exponent $\lambda$ adjusts the direction of token preference through $g_i=h_i^{\lambda}$. When $\lambda=0$, all responses are weighted equally; positive $\lambda$ favors longer responses, negative $\lambda$ emphasizes shorter ones.
  • Figure 3: Training curves for the Qwen2.5-1.5B base model across benchmarks. Each panel shows accuracy (%) versus training step for three methods ($\lambda$-GRPO, DAPO, and GRPO). Steps are {0, 40, 80, 120, 160}.
  • Figure 4: Comparison of our method against GRPO and DAPO across two diagnostics: (a) entropy and (b) response length. Both experiments are based on Qwen2.5-3B.
  • Figure 5: Training curves for the Qwen2.5-3B base model across benchmarks. Each panel shows accuracy (%) versus training step for three methods ($\lambda$-GRPO, DAPO, and GRPO). Steps are {0, 40, 80, 120, 160}.