On the Optimal Reasoning Length for RL-Trained Language Models
Daisuke Nohara, Taishi Nakamura, Rio Yokota
TL;DR
The paper addresses how to balance reasoning performance and computational efficiency in RL-trained language models by systematically evaluating length-control strategies across two models with different pre-existing reasoning abilities. It reveals a model-dependent relationship between output length and performance: a monotonic improvement with length for Qwen3-1.7B-Base and a non-monotonic optimum at intermediate lengths for DeepSeek-R1-Distill-Qwen-1.5B. The authors identify two failure modes—dispersion with long outputs and under-thinking with short outputs—and show that length penalties can hinder learning unless carefully tuned, especially for models lacking strong prior reasoning. These findings highlight the need for adaptive length-control methods that align with a model’s inherent reasoning capabilities to achieve efficient, high-quality RL-trained reasoning.
Abstract
Reinforcement learning substantially improves reasoning in large language models, but it also tends to lengthen chain of thought outputs and increase computational cost during both training and inference. Though length control methods have been proposed, it remains unclear what the optimal output length is for balancing efficiency and performance. In this work, we compare several length control methods on two models, Qwen3-1.7B Base and DeepSeek-R1-Distill-Qwen-1.5B. Our results indicate that length penalties may hinder reasoning acquisition, while properly tuned length control can improve efficiency for models with strong prior reasoning. By extending prior work to RL trained policies, we identify two failure modes, 1) long outputs increase dispersion, and 2) short outputs lead to under-thinking.
