Table of Contents
Fetching ...

Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning

Violet Xiang, Chase Blagden, Rafael Rafailov, Nathan Lile, Sang Truong, Chelsea Finn, Nick Haber

TL;DR

Adaptive Length Penalty (ALP) introduces a difficulty-conditioned RL objective that uses online solve rates to modulate generation length per prompt, reducing tokens on easy tasks while preserving reasoning on hard ones. By integrating a differentiable penalty inversely scaled by $p_{ ext{solved}}(q)$, ALP enables an adaptive allocation of inference-time compute without extra cost to standard RL loops. Empirical results on DeepScaleR-1.5B show about a 50% reduction in token usage with minimal accuracy loss, plus strong Pareto-efficiency gains and robust performance across varying difficulty distributions. The approach yields interpretable shifts in reasoning behavior, minimizing redundancy and backtracking while maintaining structured problem-solving, with potential for broad applicability in real-world inference scenarios.

Abstract

Large reasoning models (LRMs) achieve higher performance on challenging reasoning tasks by generating more tokens at inference time, but this verbosity often wastes computation on easy problems. Existing solutions, including supervised finetuning on shorter traces, user-controlled budgets, or RL with uniform penalties, either require data curation, manual configuration, or treat all problems alike regardless of difficulty. We introduce Adaptive Length Penalty (ALP), a reinforcement learning objective tailoring generation length to per-prompt solve rate. During training, ALP monitors each prompt's online solve rate through multiple rollouts and adds a differentiable penalty whose magnitude scales inversely with that rate, so confident (easy) prompts incur a high cost for extra tokens while hard prompts remain unhindered. Posttraining DeepScaleR-1.5B with ALP cuts average token usage by 50\% without significantly dropping performance. Relative to fixed-budget and uniform penalty baselines, ALP redistributes its reduced budget more intelligently by cutting compute on easy prompts and reallocating saved tokens to difficult ones, delivering higher accuracy on the hardest problems with higher cost.

Just Enough Thinking: Efficient Reasoning with Adaptive Length Penalties Reinforcement Learning

TL;DR

Adaptive Length Penalty (ALP) introduces a difficulty-conditioned RL objective that uses online solve rates to modulate generation length per prompt, reducing tokens on easy tasks while preserving reasoning on hard ones. By integrating a differentiable penalty inversely scaled by , ALP enables an adaptive allocation of inference-time compute without extra cost to standard RL loops. Empirical results on DeepScaleR-1.5B show about a 50% reduction in token usage with minimal accuracy loss, plus strong Pareto-efficiency gains and robust performance across varying difficulty distributions. The approach yields interpretable shifts in reasoning behavior, minimizing redundancy and backtracking while maintaining structured problem-solving, with potential for broad applicability in real-world inference scenarios.

Abstract

Large reasoning models (LRMs) achieve higher performance on challenging reasoning tasks by generating more tokens at inference time, but this verbosity often wastes computation on easy problems. Existing solutions, including supervised finetuning on shorter traces, user-controlled budgets, or RL with uniform penalties, either require data curation, manual configuration, or treat all problems alike regardless of difficulty. We introduce Adaptive Length Penalty (ALP), a reinforcement learning objective tailoring generation length to per-prompt solve rate. During training, ALP monitors each prompt's online solve rate through multiple rollouts and adds a differentiable penalty whose magnitude scales inversely with that rate, so confident (easy) prompts incur a high cost for extra tokens while hard prompts remain unhindered. Posttraining DeepScaleR-1.5B with ALP cuts average token usage by 50\% without significantly dropping performance. Relative to fixed-budget and uniform penalty baselines, ALP redistributes its reduced budget more intelligently by cutting compute on easy prompts and reallocating saved tokens to difficult ones, delivering higher accuracy on the hardest problems with higher cost.

Paper Structure

This paper contains 23 sections, 2 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Pass@1 Performance with different inference budgets (512, 1024, 2048, 4096). Inference budget is enforced by setting the max number of generation tokens.
  • Figure 2: Pareto efficiency analysis reveals how models distribute computational resources across problems of varying difficulty (inference budget 4096).(Left) Cumulative token allocation curves for problems ordered from easiest to hardest, aggregated across MATH-500, OlympiadBench, and AIME datasets. Shaded regions indicate easy (0-50%) and hard (80-100%) problem ranges. (Right) Adaptation ratio is computed as tokens used for hard problems / tokens used for easy problems.
  • Figure 3: Performance-efficiency trade-offs under varying problem distributions (inference budget 4096). Each curve shows model behavior as MATH/AIME mixture changes. (Left) N=500 with 0-12% AIME content (typical deployment). (Right) N=100 with 0-60% AIME content (stress test). ALP maintains strong performance across all distributions through adaptive token allocation.
  • Figure 4: Token allocation reveals how models internally perceive problem difficulty (inference budget 4096). Average tokens used versus difficulty (1 - solve rate) across three datasets.
  • Figure 5: Token usage by MATH-500 difficulty levels.
  • ...and 1 more figures