LSPO: Length-aware Dynamic Sampling for Policy Optimization in LLM Reasoning
Weizhe Chen, Sven Koenig, Bistra Dilkina
TL;DR
RLVR training for reasoning models often relies on loss design and static data selection; this work proposes LSPO, a Length-aware Dynamic Sampling method that uses $L(q)$, the average response length per prompt, to retain only the extreme short and long responses via thresholds $L_{low}$, $L_{high}$ (and $L_{max}$), recalculated each batch. When paired with base RLVR algorithms such as GRPO, DAPO, or GSPO, LSPO consistently improves final test accuracy across multiple math benchmarks and base models, though rollout time increases due to dynamic sampling. Ablation studies show that training on extreme lengths yields better generalization than intermediate lengths, and that length-based filtering outperforms fixed or accuracy-based alternatives on average. Overall, LSPO demonstrates that incorporating response-length signals into dynamic sampling can boost reasoning performance and directs future work toward adaptive thresholds and complementary criteria for data selection.
Abstract
Since the release of Deepseek-R1, reinforcement learning with verifiable rewards (RLVR) has become a central approach for training large language models (LLMs) on reasoning tasks. Recent work has largely focused on modifying loss functions to make RLVR more efficient and effective. In this paper, motivated by studies of overthinking in LLMs, we propose Length-aware Sampling for Policy Optimization (LSPO), a novel meta-RLVR algorithm that dynamically selects training data at each step based on the average response length. We evaluate LSPO across multiple base models and datasets, demonstrating that it consistently improves learning effectiveness. In addition, we conduct a detailed ablation study to examine alternative ways of incorporating length signals into dynamic sampling, offering further insights and highlighting promising directions for future research.
