Table of Contents
Fetching ...

e1: Learning Adaptive Control of Reasoning Effort

Michael Kleinman, Matthew Trager, Alessandro Achille, Wei Xia, Stefano Soatto

TL;DR

This work tackles the problem of trading off reasoning effort against latency and cost by introducing Adaptive Effort Control (AEC), a reinforcement-learning-based method that trains models to answer with a user-specified fraction $r$ of the current average chain-of-thought length. By prompting the model with $x_r$ and using a reward $R_{AEC}$ that grants reward only when the reasoning time $ rac{\ell(h)}{T_{avg}(x_r)}$ stays below $r$, AEC learns to allocate computation adaptively without task-specific tuning. Across model scales from 1.5B to 32B and on math datasets such as AIME, AMC, and MATH500, AEC achieves a 2–3× reduction in chain-of-thought length while maintaining or improving accuracy, and enables linear, interpretable control over relative accuracy or token usage via post-training calibration. The approach generalizes to out-of-domain tasks, supports inference-time adjustment of effort, and offers a simple RL-based modification to existing training pipelines with potential extensions to multi-agent settings.

Abstract

Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning. Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost. To leverage this tradeoff effectively, users need fine-grained control over the amount of thinking used for a particular query, but few approaches enable such control. Existing methods require users to specify the absolute number of desired tokens, but this requires knowing the difficulty of the problem beforehand to appropriately set the token budget for a query. To address these issues, we propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens relative to the current average chain-of-thought length for each query. This approach eliminates dataset- and phase-specific tuning while producing better cost-accuracy tradeoff curves compared to standard methods. Users can dynamically adjust the cost-accuracy trade-off through a continuous effort parameter specified at inference time. We observe that the model automatically learns to allocate resources proportionally to the task difficulty and, across model scales ranging from 1.5B to 32B parameters, our approach enables a 2-3x reduction in chain-of-thought length while maintaining or improving performance relative to the base model used for RL training.

e1: Learning Adaptive Control of Reasoning Effort

TL;DR

This work tackles the problem of trading off reasoning effort against latency and cost by introducing Adaptive Effort Control (AEC), a reinforcement-learning-based method that trains models to answer with a user-specified fraction of the current average chain-of-thought length. By prompting the model with and using a reward that grants reward only when the reasoning time stays below , AEC learns to allocate computation adaptively without task-specific tuning. Across model scales from 1.5B to 32B and on math datasets such as AIME, AMC, and MATH500, AEC achieves a 2–3× reduction in chain-of-thought length while maintaining or improving accuracy, and enables linear, interpretable control over relative accuracy or token usage via post-training calibration. The approach generalizes to out-of-domain tasks, supports inference-time adjustment of effort, and offers a simple RL-based modification to existing training pipelines with potential extensions to multi-agent settings.

Abstract

Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning. Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost. To leverage this tradeoff effectively, users need fine-grained control over the amount of thinking used for a particular query, but few approaches enable such control. Existing methods require users to specify the absolute number of desired tokens, but this requires knowing the difficulty of the problem beforehand to appropriately set the token budget for a query. To address these issues, we propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens relative to the current average chain-of-thought length for each query. This approach eliminates dataset- and phase-specific tuning while producing better cost-accuracy tradeoff curves compared to standard methods. Users can dynamically adjust the cost-accuracy trade-off through a continuous effort parameter specified at inference time. We observe that the model automatically learns to allocate resources proportionally to the task difficulty and, across model scales ranging from 1.5B to 32B parameters, our approach enables a 2-3x reduction in chain-of-thought length while maintaining or improving performance relative to the base model used for RL training.

Paper Structure

This paper contains 27 sections, 14 equations, 19 figures.

Figures (19)

  • Figure 1: Variable Effort Control and Improved Reasoning Efficiency. (Left) After training, increasing $r$ in the prompt leads to increasing accuracy and number of generated tokens across datasets of varying difficulty. Note that in addition to allowing control, this training approach also allows us to significantly reduce the number of generated tokens compared to the baseline (in crosses) and with regular RL training (squares). (Right) We apply our approach across model scales ranging from 1.5B to 32B and show results averaged across math datasets (see Appendix Fig. \ref{['fig:app-model-sizes']} for dataset-specific results), with our approach enabling more efficient reasoning and improved performance relative to the base models. We observe superior performance vs number of generated token curves with increasing model size.
  • Figure 2: (Left) Comparison against other length-control baselines. Our controllable model (e1) obtains superior accuracy-token tradeoff compares to L1-Exact, and even slightly better than L1-Max (which does not allow precise control over the generation length) when the effort level is above $60$%. See Appendix Fig. \ref{['fig:app-l1-comparison']} for performance on individual datasets. (Right) Adaptation to query difficulty. In addition to providing control over the number of tokens, e1 is automatically able to adapt the length of the reasoning chain to the difficulty of the input task. We see this by ordering the problems from easiest to hardest (based on fraction of times the model solves a problem correctly) and plotting the cumulative percentage of total tokens used for solving those problems. A diagonal line indicates equal tokens allocation to each problem, and increasingly convex curves indicate fewer token allocation to easier problems. Our approach (shown for $r=1$, purple) allocates a smaller percentage of tokens to easier problems than other length-control baselines and the base model used for RL training. In particular our model allocates only $\sim$20% of tokens to the easiest 50% of problems.
  • Figure 3: Transferability of effort control skill across different domains: Despite training on math questions, we find that after training with our objective that we can modulate the number of generated tokens and performance on different domains by varying the effort parameter $r$. Across all of the datasets we find that the number of generated tokens and accuracy increases with the effort parameter supplied in the prompt.
  • Figure 4: Calibration for interpretable control over percentage effort. After training, we can reparameterize the effort parameter $r$ to linearly control either the (left) relative accuracy or (right) relative number of tokens and enable more intuitive user control.
  • Figure 5: Learning dynamics of AEC training objective. We plot the accuracy and number of tokens as a function of training step for the three math evaluation datasets. Early in training, the model has similar accuracy and number of generated tokens regardless of the effort level, which means that the model needed to learn to become sensitive to the effort level. We find that the number generated tokens roughly decreases as a power law with exponent $\beta$ across training steps. We find that shorter problems (Math 500) decrease faster (larger $\beta$) than longer problems (AIME). Despite token lengths decreasing through training, we find that accuracy increases (provided that the effort level was above $70\%$).
  • ...and 14 more figures