Table of Contents
Fetching ...

ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better

Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng, Qi She, Shanghang Zhang

TL;DR

ChainV introduces a training-free multimodal reasoning framework that injects atomic visual hints grounded in answer-aware visual evidence to curb redundant self-reflection. By combining coarse visual patch selection, fine-grained atomic hints (lines, triangles, boxes), and a consistency-based reliability score, ChainV adaptively interrupts slow thinking with grounded cues via a Bernoulli trigger. Across six reasoning benchmarks and multiple models, ChainV achieves consistent accuracy gains (approximately +2.1 to +3.5 percentage points) while reducing inference latency by about 30-34%, and in some cases by larger margins on math-dense tasks. This approach demonstrates that dynamically evolving visual grounding can yield faster, more accurate multimodal reasoning without additional training or architectural changes, with broad implications for efficient deployment of vision-language systems.

Abstract

Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3\%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4\%$ and shortening output token length by $24.5\%$.

ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better

TL;DR

ChainV introduces a training-free multimodal reasoning framework that injects atomic visual hints grounded in answer-aware visual evidence to curb redundant self-reflection. By combining coarse visual patch selection, fine-grained atomic hints (lines, triangles, boxes), and a consistency-based reliability score, ChainV adaptively interrupts slow thinking with grounded cues via a Bernoulli trigger. Across six reasoning benchmarks and multiple models, ChainV achieves consistent accuracy gains (approximately +2.1 to +3.5 percentage points) while reducing inference latency by about 30-34%, and in some cases by larger margins on math-dense tasks. This approach demonstrates that dynamically evolving visual grounding can yield faster, more accurate multimodal reasoning without additional training or architectural changes, with broad implications for efficient deployment of vision-language systems.

Abstract

Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by and shortening output token length by .

Paper Structure

This paper contains 34 sections, 13 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: ChainV enables efficient multimodal reasoning. Results demonstrate shorter reasoning chains, lower inference latency, and improved accuracy with ChainV.
  • Figure 2: The pipeline of ChainV. In Figure (a), a multimodal reasoning model is solving a mathematical problem, during which our ChainV is invoked twice. Figure (b-d) shows the detailed process of ChainV, where the output is a visual hint annotated with coordinates.
  • Figure 3: Accuracy--Latency trade-off on multimodal reasoning models. Green and red numbers indicate the reduction in inference time and the accuracy improvement achieved by ChainV compared to the baseline, respectively. Yellow arrows point toward better performance.
  • Figure 4: Comparison of REP metric across six benchmarks, based on the accuracy and the length of output tokens. Higher REP indicates a more favorable trade-off between reasoning accuracy and output efficiency. The evaluated model is MiMo-VL-RL 7B.
  • Figure 5: Visualization of the received attention of visual assistant. The reasoning case is randomly sampled from MIMO-VL-RL 7B on the MathVista benchmark. Best viewed in color.
  • ...and 3 more figures