Table of Contents
Fetching ...

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

Fanfan Liu, Youyang Yin, Peng Shi, Siqi Yang, Zhixiong Zeng, Haibo Qiu

TL;DR

The paper analyzes length variation in RLVR objectives for large language and vision-language models, revealing a bias that can cause length collapse or undesired shortening in training. It proposes Length-Unbiased Sequence Policy Optimization (LUSPO), which multiplies each sequence loss by its length to neutralize length bias while preserving the sequence-level advantages of GSPO. Theoretical gradient analysis shows LUSPO eliminates length-dependent bias and maintains stable updates, and extensive experiments across dense and MoE models, as well as text-only and multimodal benchmarks, demonstrate superior accuracy and longer, more expressive responses compared to GRPO and GSPO. This yields a robust, scalable optimization approach for RLVR with meaningful implications for improving reasoning capabilities in large models.

Abstract

Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.

Length-Unbiased Sequence Policy Optimization: Revealing and Controlling Response Length Variation in RLVR

TL;DR

The paper analyzes length variation in RLVR objectives for large language and vision-language models, revealing a bias that can cause length collapse or undesired shortening in training. It proposes Length-Unbiased Sequence Policy Optimization (LUSPO), which multiplies each sequence loss by its length to neutralize length bias while preserving the sequence-level advantages of GSPO. Theoretical gradient analysis shows LUSPO eliminates length-dependent bias and maintains stable updates, and extensive experiments across dense and MoE models, as well as text-only and multimodal benchmarks, demonstrate superior accuracy and longer, more expressive responses compared to GRPO and GSPO. This yields a robust, scalable optimization approach for RLVR with meaningful implications for improving reasoning capabilities in large models.

Abstract

Recent applications of Reinforcement Learning with Verifiable Rewards (RLVR) to Large Language Models (LLMs) and Vision-Language Models (VLMs) have demonstrated significant success in enhancing reasoning capabilities for complex tasks. During RLVR training, an increase in response length is often regarded as a key factor contributing to the growth of reasoning ability. However, the patterns of change in response length vary significantly across different RLVR algorithms during the training process. To provide a fundamental explanation for these variations, this paper conducts an in-depth analysis of the components of mainstream RLVR algorithms. We present a theoretical analysis of the factors influencing response length and validate our theory through extensive experimentation. Building upon these theoretical findings, we propose the Length-Unbiased Sequence Policy Optimization (LUSPO) algorithm. Specifically, we rectify the length bias inherent in Group Sequence Policy Optimization (GSPO), rendering its loss function unbiased with respect to response length and thereby resolving the issue of response length collapse. We conduct extensive experiments across mathematical reasoning benchmarks and multimodal reasoning scenarios, where LUSPO consistently achieves superior performance. Empirical results demonstrate that LUSPO represents a novel, state-of-the-art optimization strategy compared to existing methods such as GRPO and GSPO.
Paper Structure (19 sections, 11 equations, 7 figures, 5 tables)

This paper contains 19 sections, 11 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: Response length during RLVR training for Qwen2.5-VL-7B-Instruct. Under strictly controlled experimental settings (with all conditions except for the loss function kept constant), we compared the response length curves of GRPO and GSPO. It can be observed that GRPO induces the model to generate longer responses, while GSPO leads the model to gradually shorten its response length during training.
  • Figure 2: Response length during the training of Qwen2.5-VL-7B-Instruct with GSPO on different datasets exhibits different trends.
  • Figure 3: System prompt used during VL model training
  • Figure 4: Training curves of GSPO and LUSPO across reponse length.
  • Figure 5: Training curves of GSPO and LUSPO across accuracy reward.
  • ...and 2 more figures