Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning
Violet Xiang, Chase Blagden, Rafael Rafailov, Nathan Lile, Sang Truong, Chelsea Finn, Nick Haber
TL;DR
Adaptive Length Penalty (ALP) introduces a difficulty-conditioned RL objective that uses online solve rates to modulate generation length per prompt, reducing tokens on easy tasks while preserving reasoning on hard ones. By integrating a differentiable penalty inversely scaled by $p_{ ext{solved}}(q)$, ALP enables an adaptive allocation of inference-time compute without extra cost to standard RL loops. Empirical results on DeepScaleR-1.5B show about a 50% reduction in token usage with minimal accuracy loss, plus strong Pareto-efficiency gains and robust performance across varying difficulty distributions. The approach yields interpretable shifts in reasoning behavior, minimizing redundancy and backtracking while maintaining structured problem-solving, with potential for broad applicability in real-world inference scenarios.
Abstract
Large reasoning models (LRMs) achieve higher performance on challenging reasoning tasks by generating more tokens at inference time, but this verbosity often wastes computation on easy problems. Existing solutions, including supervised finetuning on shorter traces, user-controlled budgets, or RL with uniform penalties, either require data curation, manual configuration, or treat all problems alike regardless of difficulty. We introduce Adaptive Length Penalty (ALP), a reinforcement learning objective tailoring generation length to per-prompt solve rate. During training, ALP monitors each prompt's online solve rate through multiple rollouts and adds a differentiable penalty whose magnitude scales inversely with that rate, so confident (easy) prompts incur a high cost for extra tokens while hard prompts remain unhindered. Posttraining DeepScaleR-1.5B with ALP cuts average token usage by 50\% without significantly dropping performance. Relative to fixed-budget and uniform penalty baselines, ALP redistributes its reduced budget more intelligently by cutting compute on easy prompts and reallocating saved tokens to difficult ones, delivering higher accuracy on the hardest problems with higher cost.
