See What You Are Told: Visual Attention Sink in Large Multimodal Models

Seil Kang; Jinyeong Kim; Junhyeok Kim; Seong Jae Hwang

See What You Are Told: Visual Attention Sink in Large Multimodal Models

Seil Kang, Jinyeong Kim, Junhyeok Kim, Seong Jae Hwang

TL;DR

This work identifies a visual attention sink in large multimodal models, where irrelevant visual tokens attract high attention due to sink-dimension activations inherited from base language models. It introduces Visual Attention Redistribution (VAR), a two-step, training-free method that selects image-centric attention heads and reallocates surplus attention from sink tokens to visual non-sink tokens, strengthening image grounding. Across diverse LMMs and vision-language benchmarks, VAR yields consistent performance gains on general, hallucination, and vision-centric tasks, without extra training or inference steps. The approach also integrates with existing techniques like VCD, offering a practical route to enhance multimodal capabilities by steering internal attention dynamics toward the image content.

Abstract

Large multimodal models (LMMs) "see" images by leveraging the attention mechanism between text and visual tokens in the transformer decoder. Ideally, these models should focus on key visual information relevant to the text token. However, recent findings indicate that LMMs have an extraordinary tendency to consistently allocate high attention weights to specific visual tokens, even when these tokens are irrelevant to the corresponding text. In this study, we investigate the property behind the appearance of these irrelevant visual tokens and examine their characteristics. Our findings show that this behavior arises due to the massive activation of certain hidden state dimensions, which resembles the attention sink found in language models. Hence, we refer to this phenomenon as the visual attention sink. In particular, our analysis reveals that removing the irrelevant visual sink tokens does not impact model performance, despite receiving high attention weights. Consequently, we recycle the attention to these tokens as surplus resources, redistributing the attention budget to enhance focus on the image. To achieve this, we introduce Visual Attention Redistribution (VAR), a method that redistributes attention in image-centric heads, which we identify as innately focusing on visual information. VAR can be seamlessly applied across different LMMs to improve performance on a wide range of tasks, including general vision-language tasks, visual hallucination tasks, and vision-centric tasks, all without the need for additional training, models, or inference steps. Experimental results demonstrate that VAR enables LMMs to process visual information more effectively by adjusting their internal attention mechanisms, offering a new direction to enhancing the multimodal capabilities of LMMs.

See What You Are Told: Visual Attention Sink in Large Multimodal Models

TL;DR

Abstract

See What You Are Told: Visual Attention Sink in Large Multimodal Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (15)