Table of Contents
Fetching ...

ARM: Adaptive Reasoning Model

Siye Wu, Jian Xie, Yikai Zhang, Aili Chen, Kai Zhang, Yu Su, Yanghua Xiao

TL;DR

The paper tackles the inefficiency and overthinking of large reasoning systems that default to long, verbose reasoning. It introduces Adaptive Reasoning Model (Arm), which selects among four formats (Direct Answer, Short CoT, Code, Long CoT) and two modes (Instruction-Guided, Consensus-Guided) to balance accuracy and token usage. Trained in two stages (SFT for format understanding and Ada-GRPO RL to promote efficient format selection), Arm achieves substantial token reductions (average ~30%, up to ~70%) and about a 2x speedup in training while maintaining competitive performance. The work demonstrates that adaptive format selection can mitigate overthinking and enable efficient, autonomous reasoning across diverse tasks and backbones.

Abstract

While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.

ARM: Adaptive Reasoning Model

TL;DR

The paper tackles the inefficiency and overthinking of large reasoning systems that default to long, verbose reasoning. It introduces Adaptive Reasoning Model (Arm), which selects among four formats (Direct Answer, Short CoT, Code, Long CoT) and two modes (Instruction-Guided, Consensus-Guided) to balance accuracy and token usage. Trained in two stages (SFT for format understanding and Ada-GRPO RL to promote efficient format selection), Arm achieves substantial token reductions (average ~30%, up to ~70%) and about a 2x speedup in training while maintaining competitive performance. The work demonstrates that adaptive format selection can mitigate overthinking and enable efficient, autonomous reasoning across diverse tasks and backbones.

Abstract

While large reasoning models demonstrate strong performance on complex tasks, they lack the ability to adjust reasoning token usage based on task difficulty. This often leads to the "overthinking" problem -- excessive and unnecessary reasoning -- which, although potentially mitigated by human intervention to control the token budget, still fundamentally contradicts the goal of achieving fully autonomous AI. In this work, we propose Adaptive Reasoning Model (ARM), a reasoning model capable of adaptively selecting appropriate reasoning formats based on the task at hand. These formats include three efficient ones -- Direct Answer, Short CoT, and Code -- as well as a more elaborate format, Long CoT. To train ARM, we introduce Ada-GRPO, an adaptation of Group Relative Policy Optimization (GRPO), which addresses the format collapse issue in traditional GRPO. Ada-GRPO enables ARM to achieve high token efficiency, reducing tokens by an average of 30%, and up to 70%, while maintaining performance comparable to the model that relies solely on Long CoT. Furthermore, not only does it improve inference efficiency through reduced token generation, but it also brings a 2x speedup in training. In addition to the default Adaptive Mode, ARM supports two additional reasoning modes: 1) Instruction-Guided Mode, which allows users to explicitly specify the reasoning format via special tokens -- ideal when the appropriate format is known for a batch of tasks. 2) Consensus-Guided Mode, which aggregates the outputs of the three efficient formats and resorts to Long CoT in case of disagreement, prioritizing performance with higher token usage.

Paper Structure

This paper contains 46 sections, 7 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: (a) Comparison of reasoning behaviors across different models on easy and hard tasks. The General Model fails on harder tasks without elaborate reasoning. The Reasoning Model applies Long CoT across all tasks, causing the "overthinking" phenomenon. In contrast, our proposed Arm adapts its reasoning formats based on task difficulty, answering easy questions efficiently while adopting Long CoT for hard tasks. (b) Accuracy versus token cost for Qwen2.5 under different training strategies. "SFT", "+GRPO", and "+Ada-GRPO" refer to models trained with SFT, SFT+GRPO, and SFT+Ada-GRPO, respectively. "+Ada-GRPO" consistently outperforms the expected trade-off line between "SFT" and "+GRPO," demonstrating Arm's superior effectiveness-efficiency balance.
  • Figure 2: Format distribution by task difficulty with Qwen2.5-7B. The hatched areas indicate the percentage of correct answers that were generated using the selected reasoning format.
  • Figure 3: Accuracy comparison between Arm’s Adaptive and Instruction-Guided modes. The figure shows average accuracy across evaluation datasets, with Direct Answer applied only to commonsense and symbolic tasks, as it does not appear in mathematical tasks in Adaptive mode.
  • Figure 4: Relative accuracy and token usage of different models compared to their backbone models on CSQA. "L1" denotes L1-Exact aggarwal2025l1, and "TP" denotes ThinkPrunehou2025thinkprune. "$\tau$-Accuracy" and "$\tau$-#Tokens" are reported relative to each model's backbone after RL training.
  • Figure 5: Performance on the training set across different model sizes trained with Ada-GRPO and GRPO. Except for the implementation of the algorithm, all hyperparameters are kept the same.
  • ...and 4 more figures