e1: Learning Adaptive Control of Reasoning Effort
Michael Kleinman, Matthew Trager, Alessandro Achille, Wei Xia, Stefano Soatto
TL;DR
This work tackles the problem of trading off reasoning effort against latency and cost by introducing Adaptive Effort Control (AEC), a reinforcement-learning-based method that trains models to answer with a user-specified fraction $r$ of the current average chain-of-thought length. By prompting the model with $x_r$ and using a reward $R_{AEC}$ that grants reward only when the reasoning time $rac{\ell(h)}{T_{avg}(x_r)}$ stays below $r$, AEC learns to allocate computation adaptively without task-specific tuning. Across model scales from 1.5B to 32B and on math datasets such as AIME, AMC, and MATH500, AEC achieves a 2–3× reduction in chain-of-thought length while maintaining or improving accuracy, and enables linear, interpretable control over relative accuracy or token usage via post-training calibration. The approach generalizes to out-of-domain tasks, supports inference-time adjustment of effort, and offers a simple RL-based modification to existing training pipelines with potential extensions to multi-agent settings.
Abstract
Increasing the thinking budget of AI models can significantly improve accuracy, but not all questions warrant the same amount of reasoning. Users may prefer to allocate different amounts of reasoning effort depending on how they value output quality versus latency and cost. To leverage this tradeoff effectively, users need fine-grained control over the amount of thinking used for a particular query, but few approaches enable such control. Existing methods require users to specify the absolute number of desired tokens, but this requires knowing the difficulty of the problem beforehand to appropriately set the token budget for a query. To address these issues, we propose Adaptive Effort Control, a self-adaptive reinforcement learning method that trains models to use a user-specified fraction of tokens relative to the current average chain-of-thought length for each query. This approach eliminates dataset- and phase-specific tuning while producing better cost-accuracy tradeoff curves compared to standard methods. Users can dynamically adjust the cost-accuracy trade-off through a continuous effort parameter specified at inference time. We observe that the model automatically learns to allocate resources proportionally to the task difficulty and, across model scales ranging from 1.5B to 32B parameters, our approach enables a 2-3x reduction in chain-of-thought length while maintaining or improving performance relative to the base model used for RL training.
