Table of Contents
Fetching ...

ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

Shuang Chen, Yue Guo, Yimeng Ye, Shijue Huang, Wenbo Hu, Haoxi Li, Manyuan Zhang, Jiayu Chen, Song Guo, Nanyun Peng

TL;DR

ARES tackles the inefficiency of multimodal large reasoning models by balancing reasoning depth with task difficulty. It introduces a two-stage training pipeline—AdaCS for difficulty-aware cold-start fine-tuning and AEPO for adaptive entropy policy optimization—using high window entropy as a trigger and a hierarchical KL budget to modulate exploration. The approach yields superior accuracy and inference efficiency across a wide suite of multimodal and textual benchmarks, closely approaching leading proprietary systems at lower costs. Together, these contributions establish a principled, scalable method for adaptive reasoning in multimodal LLMs with practical deployment implications.

Abstract

Recent advances in multimodal large reasoning models (MLRMs) have substantially improved their ability to solve complex textual and visual tasks. However, these models tend to overthink on simple problems, producing unnecessarily lengthy reasoning traces, while under-exploring on challenging ones, leading to missed solutions. To address this imbalance, we propose ARES, a unified open-source framework for adaptive reasoning that dynamically allocates exploration effort based on task difficulty. Our approach is motivated by two key empirical findings: (i) while single-token entropy is noisy, high window-entropy (HWE) tokens (token-level entropies averaged under a sliding window) can reliably capture reasoning-critical moments; and (ii) reducing HWE usage benefits easy problems, while increasing it is essential for solving hard ones. Building on these insights, ARES introduces a two-stage training pipeline. In the Adaptive Cold-Start stage, we curate multimodal and textual data paired with reasoning traces of length proportional to problem difficulty, equipping the model with initial difficulty awareness. In the second stage, we develop Adaptive Entropy Policy Optimization (AEPO), which uses HWE tokens as exploration triggers to decide when to explore, and a hierarchical entropy reward with dynamic KL control to decide how much to explore. Extensive experiments demonstrate that ARES achieves superior performance and reasoning efficiency across diverse mathematical, logical, and multimodal benchmarks, while closing the gap to leading commercial systems under significantly lower inference costs.

ARES: Multimodal Adaptive Reasoning via Difficulty-Aware Token-Level Entropy Shaping

TL;DR

ARES tackles the inefficiency of multimodal large reasoning models by balancing reasoning depth with task difficulty. It introduces a two-stage training pipeline—AdaCS for difficulty-aware cold-start fine-tuning and AEPO for adaptive entropy policy optimization—using high window entropy as a trigger and a hierarchical KL budget to modulate exploration. The approach yields superior accuracy and inference efficiency across a wide suite of multimodal and textual benchmarks, closely approaching leading proprietary systems at lower costs. Together, these contributions establish a principled, scalable method for adaptive reasoning in multimodal LLMs with practical deployment implications.

Abstract

Recent advances in multimodal large reasoning models (MLRMs) have substantially improved their ability to solve complex textual and visual tasks. However, these models tend to overthink on simple problems, producing unnecessarily lengthy reasoning traces, while under-exploring on challenging ones, leading to missed solutions. To address this imbalance, we propose ARES, a unified open-source framework for adaptive reasoning that dynamically allocates exploration effort based on task difficulty. Our approach is motivated by two key empirical findings: (i) while single-token entropy is noisy, high window-entropy (HWE) tokens (token-level entropies averaged under a sliding window) can reliably capture reasoning-critical moments; and (ii) reducing HWE usage benefits easy problems, while increasing it is essential for solving hard ones. Building on these insights, ARES introduces a two-stage training pipeline. In the Adaptive Cold-Start stage, we curate multimodal and textual data paired with reasoning traces of length proportional to problem difficulty, equipping the model with initial difficulty awareness. In the second stage, we develop Adaptive Entropy Policy Optimization (AEPO), which uses HWE tokens as exploration triggers to decide when to explore, and a hierarchical entropy reward with dynamic KL control to decide how much to explore. Extensive experiments demonstrate that ARES achieves superior performance and reasoning efficiency across diverse mathematical, logical, and multimodal benchmarks, while closing the gap to leading commercial systems under significantly lower inference costs.

Paper Structure

This paper contains 65 sections, 8 theorems, 86 equations, 9 figures, 5 tables, 1 algorithm.

Key Result

Lemma 1

For i.i.d. $R_1,\dots,R_N$ with finite variance, the GRPO group-baseline advantage $A_i = R_i - \bar{R}$ satisfies

Figures (9)

  • Figure 1: Accuracy comparison across selected open-source reasoning models on nine multimodal and textual benchmarks. Each group represents 3B-scale and 7B-scale models evaluated under the same benchmarks. The rightmost column (“Avg.”) reports the average accuracy over all selected benchmarks, showing the overall advantage of the proposed adaptive reasoning framework. Our ARES-7B achieves superior performance.
  • Figure 2: (a) F1 Score vs. threshold percentile across different window sizes. Window-based entropy aggregation consistently outperforms single-token entropy, especially at higher thresholds. (b) A word cloud visualization of semantically filtered high-entropy tokens, where font size reflects relative frequency. These tokens (e.g., explain, assume, constraint, conclude) correspond to reasoning triggers that mark the onset of logical transitions, highlighting the interpretable basis of our entropy-based reward.
  • Figure 3: Training dynamics of ValLine GRPO on Coldstart Model: (a) average response length, (b) number of high-entropy tokens, and (c) accuracy, all measured across iterations. The trends indicate that the growth in high-entropy tokens is closely aligned with increases in response length and accuracy.
  • Figure 4: Entropy--difficulty interaction in exploration. (a) Conceptual illustration: task difficulty modulates the reasoning trajectory, with easy problems requiring little exploration and hard problems benefiting from deeper branching. (b) Quantitative analysis: (i) for easy tasks, responses below the entropy threshold are both shorter and more accurate; for hard tasks, above-threshold exploration yields higher accuracy; (ii) response length increases significantly with difficulty; (iii) within each difficulty, correct cases use fewer high-entropy tokens for easy problems but more for hard problems; and (iv) correctness further amplifies this trend in response length. Together, these results show that limiting exploration improves efficiency on easy problems, while encouraging additional exploration is crucial for solving difficult ones.
  • Figure 5: Overall training pipeline of our method. Stage 1 (Adaptive Coldstart Fine-Tuning): difficulty-aware selective data curation and adaptive KL-guided fine-tuning establish a strong initialization across text and multimodal inputs. Stage 2 (Adaptive Entropy Policy Optimization, AEPO): online difficulty bucketing and entropy-aware rollout allocate reasoning depth dynamically, with high-entropy windows serving as branching points for exploration. Together, the two stages enable uncertainty-aware, difficulty-adaptive reasoning for large language models.
  • ...and 4 more figures

Theorems & Definitions (14)

  • Lemma 1: Group-baseline advantage variance
  • proof
  • Proposition 1: KL penalty inflates GRPO advantage variance
  • proof
  • Lemma 2: Strong duality and KL-as-budget
  • proof
  • Lemma 3: Pinsker-type bound
  • Theorem 1: Donsker–Varadhan control of moment budgets
  • proof
  • Lemma 4: Weighted Fisher trust region
  • ...and 4 more