Table of Contents
Fetching ...

On the Optimal Reasoning Length for RL-Trained Language Models

Daisuke Nohara, Taishi Nakamura, Rio Yokota

TL;DR

The paper addresses how to balance reasoning performance and computational efficiency in RL-trained language models by systematically evaluating length-control strategies across two models with different pre-existing reasoning abilities. It reveals a model-dependent relationship between output length and performance: a monotonic improvement with length for Qwen3-1.7B-Base and a non-monotonic optimum at intermediate lengths for DeepSeek-R1-Distill-Qwen-1.5B. The authors identify two failure modes—dispersion with long outputs and under-thinking with short outputs—and show that length penalties can hinder learning unless carefully tuned, especially for models lacking strong prior reasoning. These findings highlight the need for adaptive length-control methods that align with a model’s inherent reasoning capabilities to achieve efficient, high-quality RL-trained reasoning.

Abstract

Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.

On the Optimal Reasoning Length for RL-Trained Language Models

TL;DR

The paper addresses how to balance reasoning performance and computational efficiency in RL-trained language models by systematically evaluating length-control strategies across two models with different pre-existing reasoning abilities. It reveals a model-dependent relationship between output length and performance: a monotonic improvement with length for Qwen3-1.7B-Base and a non-monotonic optimum at intermediate lengths for DeepSeek-R1-Distill-Qwen-1.5B. The authors identify two failure modes—dispersion with long outputs and under-thinking with short outputs—and show that length penalties can hinder learning unless carefully tuned, especially for models lacking strong prior reasoning. These findings highlight the need for adaptive length-control methods that align with a model’s inherent reasoning capabilities to achieve efficient, high-quality RL-trained reasoning.

Abstract

Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.
Paper Structure (31 sections, 10 equations, 10 figures, 1 table)

This paper contains 31 sections, 10 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Score vs. average output length. Qwen3-1.7B-Base (top) shows a monotonically increasing trend, while DeepSeek-R1-Distill-Qwen-1.5B (bottom) exhibits a non-monotonic relationship with optimal performance at intermediate lengths.
  • Figure 2: Decomposition of accuracy into mode accuracy and dispersion metrics for DeepSeek-R1-Distill-Qwen-1.5B on AMC and MATH-500. In the long-output regime, degradation is driven by increased dispersion; in the short-output regime, both central tendency and dispersion are affected.
  • Figure 3: Effect of batch size configuration on training dynamics. Top left: response length during training. Top right: absolute difference between rollout and training token probabilities. Bottom: validation scores on MATH-500 and AIME 2024. The 512/32 setting (generation batch size 512, mini-batch size 32) leads to decreasing response length and validation performance, while the 64/64 setting maintains stable training.
  • Figure 4: Effect of precision and TIS on training dynamics for DeepSeek-R1-Distill-Qwen-1.5B. Comparison of BF16 with TIS, FP16 with TIS, FP16 without TIS, and ALP ($\beta=1\mathrm{e}{-4}$) in FP16 without TIS. Note that the maximum response length differs between the ALP experiment and the precision ablation experiments, so response lengths should not be directly compared across these settings.
  • Figure 5: GFPO reproduction attempt on two base models. Average output length during training for GFPO ($G=16$, $k=8$) compared to the DAPO baseline. Both Qwen3-1.7B-Base and DeepSeek-R1-Distill-Qwen-1.5B show increasing output length under GFPO, particularly in later training stages.
  • ...and 5 more figures