Table of Contents
Fetching ...

Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs

Siqu Ou, Tianrui Wan, Zhiyuan Zhao, Junyu Gao, Xuelong Li

TL;DR

MLLMs struggle to maintain accurate visual grounding, with early attention errors propagating through long reasoning. SAYO introduces an entropy-aware, region-level visual attention reward within a GRPO-based RL framework to explicitly train attention to task-relevant image regions. Data construction maps bounding boxes to visual tokens, enabling precise region rewards without external prompts. Across diverse benchmarks, SAYO achieves consistent gains in reasoning and perception tasks, with improved visual grounding and robustness to domain shifts, highlighting the importance of targeted visual attention learning in multimodal reasoning.

Abstract

While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more reliable attention behaviors. Extensive experiments across multiple multimodal benchmarks demonstrate that SAYO consistently improves performance on diverse reasoning and perception tasks.

Do MLLMs Really See It: Reinforcing Visual Attention in Multimodal LLMs

TL;DR

MLLMs struggle to maintain accurate visual grounding, with early attention errors propagating through long reasoning. SAYO introduces an entropy-aware, region-level visual attention reward within a GRPO-based RL framework to explicitly train attention to task-relevant image regions. Data construction maps bounding boxes to visual tokens, enabling precise region rewards without external prompts. Across diverse benchmarks, SAYO achieves consistent gains in reasoning and perception tasks, with improved visual grounding and robustness to domain shifts, highlighting the importance of targeted visual attention learning in multimodal reasoning.

Abstract

While chain-of-thought (CoT) reasoning has substantially improved multimodal large language models (MLLMs) on complex reasoning tasks, existing approaches largely rely on long textual reasoning trajectories and provide limited mechanisms for learning stable visual attention policies. Our analysis shows that current MLLMs exhibit weak visual focus: early-stage visual misalignment is rarely corrected during subsequent reasoning, leading to error propagation and failed inferences. We argue that this limitation stems from inadequate credit assignment for visual attention during training. To address this issue, we propose SAYO, a visual reasoning model trained with a reinforcement learning (RL) framework that introduces a region-level visual attention-based reward. This reward explicitly aligns optimization signals with visually grounded reasoning steps, enabling the model to learn more reliable attention behaviors. Extensive experiments across multiple multimodal benchmarks demonstrate that SAYO consistently improves performance on diverse reasoning and perception tasks.
Paper Structure (28 sections, 7 equations, 12 figures, 6 tables)

This paper contains 28 sections, 7 equations, 12 figures, 6 tables.

Figures (12)

  • Figure 1: In the reasoning process of CoT, initial visual focusing errors can mislead the inference processing. We emphasize enhancing the model's visual capabilities to ensure MLLMs can proactively focus on text-relevant visual regions. The resolution of image is 2752x1824. The attention map displays the average attention of all generated tokens.
  • Figure 2: Comparison of target attention score (TAS) and accuracy across Models on a part of GQA dataset. The displayed score and accuracy represent the average across all samples. * denotes models based on the Qwen2.5-7B series.
  • Figure 3: The workflow for our method, including constructing reasoning data with region visual information and training with region attention reward
  • Figure 4: Attention weights of model-generated tokens to target visual tokens (last layer) and attention weights of tokens with different entropy values to target visual tokens. The entropy values shown have been normalized across samples, and the displayed attention weights represent the average across all samples.
  • Figure 5: An example demonstrating how areas of visual attention shift during the reasoning process. The background color of tokens in the figure indicates the magnitude of the target visual attention score.
  • ...and 7 more figures