Table of Contents
Fetching ...

See What You Are Told: Visual Attention Sink in Large Multimodal Models

Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang

TL;DR

This work identifies a visual attention sink in large multimodal models, where irrelevant visual tokens attract high attention due to sink-dimension activations inherited from base language models. It introduces Visual Attention Redistribution (VAR), a two-step, training-free method that selects image-centric attention heads and reallocates surplus attention from sink tokens to visual non-sink tokens, strengthening image grounding. Across diverse LMMs and vision-language benchmarks, VAR yields consistent performance gains on general, hallucination, and vision-centric tasks, without extra training or inference steps. The approach also integrates with existing techniques like VCD, offering a practical route to enhance multimodal capabilities by steering internal attention dynamics toward the image content.

Abstract

Large multimodal models (LMMs) "see" images by leveraging the attention mechanism between text and visual tokens in the transformer decoder. Ideally, these models should focus on key visual information relevant to the text token. However, recent findings indicate that LMMs have an extraordinary tendency to consistently allocate high attention weights to specific visual tokens, even when these tokens are irrelevant to the corresponding text. In this study, we investigate the property behind the appearance of these irrelevant visual tokens and examine their characteristics. Our findings show that this behavior arises due to the massive activation of certain hidden state dimensions, which resembles the attention sink found in language models. Hence, we refer to this phenomenon as the visual attention sink. In particular, our analysis reveals that removing the irrelevant visual sink tokens does not impact model performance, despite receiving high attention weights. Consequently, we recycle the attention to these tokens as surplus resources, redistributing the attention budget to enhance focus on the image. To achieve this, we introduce Visual Attention Redistribution (VAR), a method that redistributes attention in image-centric heads, which we identify as innately focusing on visual information. VAR can be seamlessly applied across different LMMs to improve performance on a wide range of tasks, including general vision-language tasks, visual hallucination tasks, and vision-centric tasks, all without the need for additional training, models, or inference steps. Experimental results demonstrate that VAR enables LMMs to process visual information more effectively by adjusting their internal attention mechanisms, offering a new direction to enhancing the multimodal capabilities of LMMs.

See What You Are Told: Visual Attention Sink in Large Multimodal Models

TL;DR

This work identifies a visual attention sink in large multimodal models, where irrelevant visual tokens attract high attention due to sink-dimension activations inherited from base language models. It introduces Visual Attention Redistribution (VAR), a two-step, training-free method that selects image-centric attention heads and reallocates surplus attention from sink tokens to visual non-sink tokens, strengthening image grounding. Across diverse LMMs and vision-language benchmarks, VAR yields consistent performance gains on general, hallucination, and vision-centric tasks, without extra training or inference steps. The approach also integrates with existing techniques like VCD, offering a practical route to enhance multimodal capabilities by steering internal attention dynamics toward the image content.

Abstract

Large multimodal models (LMMs) "see" images by leveraging the attention mechanism between text and visual tokens in the transformer decoder. Ideally, these models should focus on key visual information relevant to the text token. However, recent findings indicate that LMMs have an extraordinary tendency to consistently allocate high attention weights to specific visual tokens, even when these tokens are irrelevant to the corresponding text. In this study, we investigate the property behind the appearance of these irrelevant visual tokens and examine their characteristics. Our findings show that this behavior arises due to the massive activation of certain hidden state dimensions, which resembles the attention sink found in language models. Hence, we refer to this phenomenon as the visual attention sink. In particular, our analysis reveals that removing the irrelevant visual sink tokens does not impact model performance, despite receiving high attention weights. Consequently, we recycle the attention to these tokens as surplus resources, redistributing the attention budget to enhance focus on the image. To achieve this, we introduce Visual Attention Redistribution (VAR), a method that redistributes attention in image-centric heads, which we identify as innately focusing on visual information. VAR can be seamlessly applied across different LMMs to improve performance on a wide range of tasks, including general vision-language tasks, visual hallucination tasks, and vision-centric tasks, all without the need for additional training, models, or inference steps. Experimental results demonstrate that VAR enables LMMs to process visual information more effectively by adjusting their internal attention mechanisms, offering a new direction to enhancing the multimodal capabilities of LMMs.

Paper Structure

This paper contains 29 sections, 6 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Visual attention maps of LLaVA-1.5-7B between specified text tokens and visual tokens. Attention map visualizes where the model "see" when processing the text token. The model is expected to focus only on the visual tokens related to each text token. However, the model also attends to irrelevant visual tokens (red boxes) that are unrelated to the corresponding text token. Although we visualize the attention maps only for a few specified text tokens, these irrelevant tokens consistently occur in fixed locations across the entire text tokens, including the instructions and the generated responses (see Fig. \ref{['fig:appendix:all-text-token']} in Appendix for more examples).
  • Figure 2: Illustration of typical architecture of LMMs and investigation of visual attention sink. A large multimodal model receives the image and text as inputs. Each text token interacts with the visual tokens through the attention mechanism in the transformer decoder. We can visualize the interaction in the form of an attention map. We discover that irrelevant visual tokens (marked as red boxes) in the attention map have massive activation in specific dimensions of hidden states, while relevant visual tokens (marked as blue boxes) do not. Well-known sink tokens (e.g., 'BOS') in language models also exhibit identical patterns in the hidden states.
  • Figure 3: Analysis of visual sink tokens. (a) Scatter plot of sink dimension values and attention weights of visual tokens. (b) Performance comparison between masking visual sink tokens and masking the same number of random visual tokens. Dashed line indicates the performance of the original model. (c) Average attention contributions of visual sink tokens and random visual tokens. (d) Visual attention maps with and without visual sink tokens, where the visual sink tokens are highlighted in red boxes.
  • Figure 4: Visualization of the attention heads sorted by visual non-sink ratio. We show some attention heads with high visual non-sink ratio (left) and low visual non-sink ratio (right). The attention heads with high visual non-sink ratio tend to focus on the visual tokens relevant to the corresponding text token. On the other hand, the attention heads with low visual non-sink ratio have vague attention patterns. The attention heads with high visual non-sink ratio are selected as the image-centric heads.
  • Figure 5: Overview of Visual Attention Redistribution (VAR). (a) We select image-centric heads by evaluating visual non-sink ratio; heads with $r^{\ell, h}_i \geq \rho$ are chosen as image-centric heads. (b) VAR redistributes surplus attention weights from sink tokens to visual non-sink tokens. The attention budget $\boldsymbol{\Omega}$ accumulates a portion $p$ of attention from sink tokens. Finally, visual non-sink tokens receive attention from $\boldsymbol{\Omega}$.
  • ...and 10 more figures