Table of Contents
Fetching ...

Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

Zichao Li, Jie Lou, Fangchen Dong, Zhiyuan Fan, Mengjie Ren, Hongyu Lin, Xianpei Han, Debing Zhang, Le Sun, Yaojie Lu, Xing Yu

TL;DR

Group Relative Reward Rescaling is presented, which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism that maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation.

Abstract

Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR$^3$), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR$^3$~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.

Tackling Length Inflation Without Trade-offs: Group Relative Reward Rescaling for Reinforcement Learning

TL;DR

Group Relative Reward Rescaling is presented, which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism that maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation.

Abstract

Reinforcement learning significantly enhances LLM capabilities but suffers from a critical issue: length inflation, where models adopt verbosity or inefficient reasoning to maximize rewards. Prior approaches struggle to address this challenge in a general and lossless manner, primarily because additive penalties introduce a compensatory effect that creates optimization shortcuts, while heuristic gating strategies lack generality beyond binary feedback. To bridge this gap, we present Group Relative Reward Rescaling (GR), which reframes length control as a multiplicative rescaling paradigm, effectively establishing a generalized, continuous, and reward-dependent gating mechanism. To further ensure lossless optimization, we incorporate group-relative regularization and advantage-aware calibration, which dynamically adapt length budgets to instance difficulty and preserve the advantage signal of high-quality trajectories. Empirically, across both RLHF and RLVR settings, GR~maintains training dynamics and downstream performance comparable to standard GRPO while significantly mitigating length inflation, outperforming state-of-the-art length-regularized baselines.
Paper Structure (45 sections, 2 theorems, 40 equations, 6 figures, 11 tables)

This paper contains 45 sections, 2 theorems, 40 equations, 6 figures, 11 tables.

Key Result

Proposition 3.1

Let $(R,S)$ have finite second moments and define $\mu_R=\mathbb{E}[R]$, $\mu_S=\mathbb{E}[S]$, $\sigma_R^2=\mathrm{Var}(R)$, $\sigma_S^2=\mathrm{Var}(S)$, and $\sigma_{RS}=\mathrm{Cov}(R,S)$. For $\hat{R}^{(+)} = R + \lambda S$ with $\lambda> 0$, and hence Therefore, the length-related signal $(S-\mu_S)$ is linearly injected into the advantage with fixed weight $\lambda$, and can contribute eve

Figures (6)

  • Figure 1: Comparison of GR$^{3}$ with open-source efficient reasoning models, all trained on DeepSeek-R1-Distill-7B. GR$^3$ pioneers a new paradigm that sustains stable performance gains under RL while simultaneously mitigating the length inflation issue.
  • Figure 2: Training dynamics of GR$^{3}$, which retains GRPO’s reward gains without loss while significantly reducing average tokens. The base models used for the two settings are DeepSeek-R1-Distill-1.5B and Qwen3-8B (without thinking mode), respectively.
  • Figure 3: Additive length regularization consistently degrades task reward across different choices of $\lambda$, whereas the multiplicative scheme maintains stable reward improvement throughout training.
  • Figure 4: Sensitivity of $\alpha$: reward gap relative to the standard GRPO baseline versus $\alpha$ (log scale). Marker color indicates the average CSR measured during actual training, while the triangle marker denotes the value of $\alpha$ selected during the calibration phase.
  • Figure 5: Comparison of multiplicative and gated additive shaping during RLHF training. Multiplicative shaping matches standard GRPO in task reward while achieving controlled length reduction. In contrast, gated additive variants with different reward thresholds underperform and produce overly short responses, reflecting optimization instability introduced by hard gating.
  • ...and 1 more figures

Theorems & Definitions (5)

  • Proposition 3.1: Additive shaping: linear injection of the length signal
  • proof
  • Proposition 3.2: Multiplicative shaping: reward-weighted length signal
  • proof
  • Remark 3.3: Why multiplicative shaping is reward-aware under group normalization