Table of Contents
Fetching ...

VL Norm: Rethink Loss Aggregation in RLVR

Zhiyuan He, Xufang Luo, Yike Zhang, Yuqing Yang, Lili Qiu

TL;DR

This work proposes VL Norm (Variance-reduced Length-dependent Normalization), a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards that provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory.

Abstract

We propose VL Norm (Variance-reduced Length-dependent Normalization), a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed VL Norm not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Besides, VL Norm is easy to implement with less than 10 lines of code change. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. When integrated into the state-of-the-art RL algorithm DAPO, it achieves up to 2.67x faster convergence on the CountDown task. Our code is public at https://github.com/zerolllin/Delta-L-Normalization.

VL Norm: Rethink Loss Aggregation in RLVR

TL;DR

This work proposes VL Norm (Variance-reduced Length-dependent Normalization), a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards that provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory.

Abstract

We propose VL Norm (Variance-reduced Length-dependent Normalization), a simple yet effective loss aggregation method tailored to the characteristic of dynamic generation lengths in Reinforcement Learning with Verifiable Rewards (RLVR). Recently, RLVR has demonstrated strong potential in improving the reasoning capabilities of large language models (LLMs), but a major challenge lies in the large variability of response lengths during training, which leads to high gradient variance and unstable optimization. Although previous methods such as GRPO, DAPO, and Dr. GRPO introduce different loss normalization terms to address this issue, they either produce biased estimates or still suffer from high gradient variance. By analyzing the effect of varying lengths on policy loss both theoretically and empirically, we reformulate the problem as finding a minimum-variance unbiased estimator. Our proposed VL Norm not only provides an unbiased estimate of the true policy loss but also minimizes gradient variance in theory. Besides, VL Norm is easy to implement with less than 10 lines of code change. Extensive experiments show that it consistently achieves superior results across different model sizes, maximum lengths, and tasks. When integrated into the state-of-the-art RL algorithm DAPO, it achieves up to 2.67x faster convergence on the CountDown task. Our code is public at https://github.com/zerolllin/Delta-L-Normalization.

Paper Structure

This paper contains 32 sections, 25 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Left: In RLVR, trajectory lengths vary significantly, and long trajectories induce high gradient variance, causing unstable training. Right: Existing gradient aggregation methods across different lengths either lead to biased updates or suffer from high variance. In this paper, we propose a new aggregation method, VL Norm, that is both unbiased and variance-minimized.
  • Figure 2: We integrate the proposed VL Norm into DAPO and evaluate it on the CountDown task and 3B model. Left: While DAPO reaches an accuracy of 0.866 at step 1000, DAPO+VL Norm achieves the same accuracy at step 375, demonstrating a 2.67× faster convergence. Right: When both are trained to step 1000, our method further delivers a 4.6% absolute accuracy gain.
  • Figure 3: Training dynamics of VL Norm compared with baselines across tasks (CountDown, Math), maximum lengths (3072, 8192), and model sizes (3B, 7B). Performance is measured by Avg@8 on CountDown and by a weighted Avg@8 across four math datasets. VL Norm consistently yields more stable training and consistently converges to higher accuracy.
  • Figure 4: Deviation $||\bm{g}_i - \mathbb{E}[\bm{g}_i]||^2$ for a random selected sample on the Q, K, V projection in the last layer. $\mathbb{E}[\bm{g}_i]$ is estimated by the average of 128 rollouts.
  • Figure 5: Comparison between low and high gradient variance.
  • ...and 6 more figures