ChainV: Atomic Visual Hints Make Multimodal Reasoning Shorter and Better
Yuan Zhang, Ming Lu, Junwen Pan, Tao Huang, Kuan Cheng, Qi She, Shanghang Zhang
TL;DR
ChainV introduces a training-free multimodal reasoning framework that injects atomic visual hints grounded in answer-aware visual evidence to curb redundant self-reflection. By combining coarse visual patch selection, fine-grained atomic hints (lines, triangles, boxes), and a consistency-based reliability score, ChainV adaptively interrupts slow thinking with grounded cues via a Bernoulli trigger. Across six reasoning benchmarks and multiple models, ChainV achieves consistent accuracy gains (approximately +2.1 to +3.5 percentage points) while reducing inference latency by about 30-34%, and in some cases by larger margins on math-dense tasks. This approach demonstrates that dynamically evolving visual grounding can yield faster, more accurate multimodal reasoning without additional training or architectural changes, with broad implications for efficient deployment of vision-language systems.
Abstract
Recent advances in multimodal reasoning models have demonstrated impressive capabilities across text and vision. However, even leading models exhibit redundant self-reflection when generating lengthy reasoning chains. While training-free CoT compression methods have emerged in the LLMs domain, they rely on static visual references and thus provide limited gains for multimodal reasoning. Therefore, we propose ChainV, a framework that dynamically integrates visual hints into the reasoning process, thereby making multimodal reasoning shorter and better. Specifically, ChainV first performs a coarse visual patch selection based on the previous reasoning step, then refines it by identifying the most representative atomic visual hint according to the averaged attention intensity. Additionally, ChainV introduces a consistency-based evaluation mechanism to assess the reliability of the chosen hint, guiding the model to adaptively adjust its level of self-reflection. Eventually, the pixel coordinates of the selected visual hint and its reliability are incorporated into thinking with a Bernoulli stochastic process. Experiments indicate that our method significantly improves reasoning accuracy and efficiency, especially on math-intensive benchmarks where visual hints are crucial for multi-step symbolic reasoning. For example, ChainV achieves $2.3\%$ improvement on the MathVista within MIMO-VL-RL, while reducing inference latency by $51.4\%$ and shortening output token length by $24.5\%$.
