Table of Contents
Fetching ...

ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code

Jian Xie, Zhendong Chu, Aoxiao Zhong, Kai Zhang, Mingzhe Han, Xing Fan, Jialie Shen, Qingsong Wen

TL;DR

ARM2 tackles the overthinking problem in large reasoning models by proposing an adaptive, multimodal framework that selects among five reasoning formats, including executable code, to balance accuracy and token efficiency. It combines a multimodal data construction regimen with supervised fine-tuning and a length-aware reinforcement learning objective (GRPO-alp) that encourages format diversity and imposes a length penalty, enabling dynamic trade-offs between reasoning depth and cost. Empirical results show ARM2 reduces token usage by over 70% on average across ID and OOD tasks while maintaining competitive performance, with code execution providing notable gains in both accuracy and efficiency. The work demonstrates the practical viability of adaptive reasoning across tasks and modalities, highlighting the benefits of explicit length-awareness and executable-code integration for scalable, efficient reasoning.

Abstract

Large Reasoning Models (LRMs) often suffer from the ``over-thinking'' problem, generating unnecessarily long reasoning on simple tasks. Some strategies have been proposed to mitigate this issue, such as length penalties or routing mechanisms, but they are typically heuristic and task-specific, lacking a general framework for adaptive reasoning. In this paper, we present ARM2, a unified model that adaptively balances reasoning performance and efficiency across multiple formats through a reinforcement learning framework augmented with length-aware optimization. Beyond conventional natural language inference, ARM2 integrates vision understanding, extending its applicability to multimodal. Moreover, ARM2 integrates executable code into reasoning, enabling substantial reductions in token cost while preserving task performance compared to long CoT. Experiments demonstrate that ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. We further conduct extensive analyses to validate the effectiveness of ARM2 and the soundness of its design.

ARM2: Adaptive Reasoning Model with Vision Understanding and Executable Code

TL;DR

ARM2 tackles the overthinking problem in large reasoning models by proposing an adaptive, multimodal framework that selects among five reasoning formats, including executable code, to balance accuracy and token efficiency. It combines a multimodal data construction regimen with supervised fine-tuning and a length-aware reinforcement learning objective (GRPO-alp) that encourages format diversity and imposes a length penalty, enabling dynamic trade-offs between reasoning depth and cost. Empirical results show ARM2 reduces token usage by over 70% on average across ID and OOD tasks while maintaining competitive performance, with code execution providing notable gains in both accuracy and efficiency. The work demonstrates the practical viability of adaptive reasoning across tasks and modalities, highlighting the benefits of explicit length-awareness and executable-code integration for scalable, efficient reasoning.

Abstract

Large Reasoning Models (LRMs) often suffer from the ``over-thinking'' problem, generating unnecessarily long reasoning on simple tasks. Some strategies have been proposed to mitigate this issue, such as length penalties or routing mechanisms, but they are typically heuristic and task-specific, lacking a general framework for adaptive reasoning. In this paper, we present ARM2, a unified model that adaptively balances reasoning performance and efficiency across multiple formats through a reinforcement learning framework augmented with length-aware optimization. Beyond conventional natural language inference, ARM2 integrates vision understanding, extending its applicability to multimodal. Moreover, ARM2 integrates executable code into reasoning, enabling substantial reductions in token cost while preserving task performance compared to long CoT. Experiments demonstrate that ARM2 achieves performance on par with traditional reasoning models trained with GRPO, while reducing token usage by over 70% on average. We further conduct extensive analyses to validate the effectiveness of ARM2 and the soundness of its design.

Paper Structure

This paper contains 34 sections, 7 equations, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Comparison of reasoning behaviors across models. Unlike the SFT and GRPO models, which consistently rely on a single reasoning format, ARM2 exhibits adaptability by selecting reasoning formats.
  • Figure 2: Ablation study of ARM2 across 12 datasets. "w/o FE" denotes the removal of format encouragement rewards while retaining length penalty rewards. "w/o LP" denotes the removal of length penalty rewards while retaining format encouragement rewards. "w/o EC" denotes disabling the code interpreter during both training and inference, forcing the model to reason solely in the code-text format.
  • Figure 3: Length distribution of different models across three representative datasets. The dashed vertical line indicates the average token cost for each dataset.
  • Figure 4: ARM2 vs GRPO with varying token budgets.
  • Figure 5: Performance of ARM2 under different length penalty strengths.
  • ...and 2 more figures