Table of Contents
Fetching ...

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

Ruiying Peng, Xueyu Wu, Jing Lei, Lu Hou, Yuanzheng Ma, Xiaohui Li

Abstract

Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.

Deeper Thought, Weaker Aim: Understanding and Mitigating Perceptual Impairment during Reasoning in Multimodal Large Language Models

Abstract

Multimodal large language models (MLLMs) often suffer from perceptual impairments under extended reasoning modes, particularly in visual question answering (VQA) tasks. We identify attention dispersion as the underlying cause: during multi-step reasoning, the model's visual attention becomes scattered and drifts away from question-relevant regions, effectively "losing focus" on the visual input. To better understand this phenomenon, we analyze the attention maps of MLLMs and observe that reasoning prompts significantly reduce attention to regions critical for answering the question. We further find a strong correlation between the model's overall attention on image tokens and the spatial dispersiveness of its attention within the image. Leveraging this insight, we propose a training-free Visual Region-Guided Attention (VRGA) framework that selects visual heads based on an entropy-focus criterion and reweights their attention, effectively guiding the model to focus on question-relevant regions during reasoning. Extensive experiments on vision-language benchmarks demonstrate that our method effectively alleviates perceptual degradation, leading to improvements in visual grounding and reasoning accuracy while providing interpretable insights into how MLLMs process visual information.
Paper Structure (13 sections, 8 equations, 5 figures, 3 tables)

This paper contains 13 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Comparison of Prompting Strategies in VQA Tasks. This figure compares three prompting strategies—Direct, Region-Guided, and Reason—in visual question answering (VQA) tasks. The attention maps show that the Direct mode focuses correctly on question-relevant regions, while the Region-Guided approach further reduces attention to irrelevant areas, enhancing visual grounding. In contrast, the Reason mode disperses attention across the scene and often guides the model to describe question-irrelevant regions, leading to incorrect or misleading summaries. In the QA results, green boxes highlight correct answers, whereas red boxes indicate errors and irrelevant reasoning. Overall, these visualizations illustrate how different prompting strategies influence attention allocation and task performance.
  • Figure 2: Layer-wise relevant region attention ratio across models. The vertical axis denotes the Relevant Region Attention Ratio, which measures the degree of attention allocated to question-relevant regions during VQA (as defined in Sec. \ref{['sec:RRAR']}). The horizontal axis represents the Transformer layer index, where the attention maps are averaged across heads. Green and red curves correspond to correct and wrong predictions, respectively. The solid line shows the mean value, and shaded areas denote ranges. Across all models, correct answers consistently exhibit higher attention to relevant regions, suggesting that focusing on the correct visual evidence is crucial for accurate reasoning.
  • Figure 3: Impact of prompting strategies on visual grounding and performance. The bar chart (left) compares TextVQA accuracy under three prompting strategies: Reason, Direct, and Region-Guided. The line plots show the layer-wise RRAR (as defined in Sec. \ref{['sec:RRAR']}) from the question-end token to relevant visual regions, which is used to measure the focus on question-related areas. Shaded regions indicate the interquartile range (Q1--Q3) of RRAR across samples. The Ocean-R1-3Bming2025ocean model is built on the Qwen2.5-VL-3B backbone, while MM-Eureka-7Bmmeureka and ThinkLite-VL-7Bthinklite are based on the Qwen2.5-VL-7B backbone. For detailed explanation of the relevant region attention ratio (RRAR), please refer to Section 3. We observe that the Reason mode, which emphasizes sequential reasoning, tends to disperse visual attention and reduce focus on question-relevant regions, leading to lower accuracy. By contrast, the proposed Region-Guided prompting re-concentrates attention toward relevant regions, achieving both improved grounding and higher task accuracy.
  • Figure 4: Attention mechanism analysis in vision models. Left: Scatter plot showing the relationship between $R_{img}$ and $H_{img}$. Points within the red box represent heads indicating effective processing of visual information.A linear fit is performed on the points, with the correlation coefficient $r$ representing the Pearson correlation. Right: Line graph comparing the Relevant Region Attention Ratio (RRAR) for different selection methods. The red line represents the RRAR for heads selected by our method, which considers both high $R_{img}$ and low EFR($\frac{H_{img}}{R_{img}}$). The green line shows the RRAR for heads based solely on $R_{img}$. The blue line indicates the RRAR across layers. Our method consistently identifies heads that precisely focus on relevant visual regions.
  • Figure 5: Overview of the Visual Region-guided Attention (VRGA) framework. Our method enhances the visual grounding capability of Multimodal Large Language Models (MLLMs) without additional training. (a) Localization of Question-relevant Regions: By integrating the attention maps from vision-focused heads $\mathcal{H}_v$ and background-biased heads $\mathcal{H}_b$, and conducting attention sink analysis, we obtain a clean, refined attention map that facilitates the localization of tokens related to the question. (b) Attention Reweighting: During the generation phase, attention to tokens within the localized regions is amplified, steering the model to focus on relevant visual content.