Table of Contents
Fetching ...

LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning

Weizhe Chen, Sven Koenig, Bistra Dilkina

TL;DR

RLVR training for reasoning models often relies on loss design and static data selection; this work proposes LSPO, a Length-aware Dynamic Sampling method that uses $L(q)$, the average response length per prompt, to retain only the extreme short and long responses via thresholds $L_{low}$, $L_{high}$ (and $L_{max}$), recalculated each batch. When paired with base RLVR algorithms such as GRPO, DAPO, or GSPO, LSPO consistently improves final test accuracy across multiple math benchmarks and base models, though rollout time increases due to dynamic sampling. Ablation studies show that training on extreme lengths yields better generalization than intermediate lengths, and that length-based filtering outperforms fixed or accuracy-based alternatives on average. Overall, LSPO demonstrates that incorporating response-length signals into dynamic sampling can boost reasoning performance and directs future work toward adaptive thresholds and complementary criteria for data selection.

Abstract

Since the release of Deepseek-R1, reinforcement learning with verifiable rewards (RLVR) has become a central approach for training large language models (LLMs) on reasoning tasks. Recent work has largely focused on modifying loss functions to make RLVR more efficient and effective. In this paper, motivated by studies of overthinking in LLMs, we propose Length-aware Sampling for Policy Optimization (LSPO), a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length. We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness. In addition, we conduct a detailed ablation study to examine alternative ways of incorporating length signals into dynamic sampling, offering further insights and highlighting promising directions for future research.

LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning

TL;DR

RLVR training for reasoning models often relies on loss design and static data selection; this work proposes LSPO, a Length-aware Dynamic Sampling method that uses , the average response length per prompt, to retain only the extreme short and long responses via thresholds , (and ), recalculated each batch. When paired with base RLVR algorithms such as GRPO, DAPO, or GSPO, LSPO consistently improves final test accuracy across multiple math benchmarks and base models, though rollout time increases due to dynamic sampling. Ablation studies show that training on extreme lengths yields better generalization than intermediate lengths, and that length-based filtering outperforms fixed or accuracy-based alternatives on average. Overall, LSPO demonstrates that incorporating response-length signals into dynamic sampling can boost reasoning performance and directs future work toward adaptive thresholds and complementary criteria for data selection.

Abstract

Since the release of Deepseek-R1, reinforcement learning with verifiable rewards (RLVR) has become a central approach for training large language models (LLMs) on reasoning tasks. Recent work has largely focused on modifying loss functions to make RLVR more efficient and effective. In this paper, motivated by studies of overthinking in LLMs, we propose Length-aware Sampling for Policy Optimization (LSPO), a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length. We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness. In addition, we conduct a detailed ablation study to examine alternative ways of incorporating length signals into dynamic sampling, offering further insights and highlighting promising directions for future research.

Paper Structure

This paper contains 37 sections, 8 equations, 3 figures, 2 tables, 1 algorithm.

Figures (3)

  • Figure 1: Illustration of Length-aware Sampling for Policy Optimization (LSPO). LSPO builds on common accuracy-based filtering and further filters responses by length, retaining prompts whose average response length is either the longest or the shortest. The vertical height of each response block represents its length.
  • Figure 2: LSPO with GSPO as the base algorithms compared to GSPO itself trained on Qwen-2.5-Math-7B with DAPO-17K dataset and the training set of the MATH dataset.
  • Figure 3: LSPO with DAPO as the base algorithms compared to DAPO itself trained on the training set of MATH on Llama-3.2-4B-Instruct.