Table of Contents
Fetching ...

When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models

Jiho Choi, Jaemin Kim, Sanghwan Kim, Seunghoon Hong, Jin-Hwi Park

Abstract

Attention sinks are defined as tokens that attract disproportionate attention. While these have been studied in single modality transformers, their cross-modal impact in Large Vision-Language Models (LVLM) remains largely unexplored: are they redundant artifacts or essential global priors? This paper first categorizes visual sinks into two distinct categories: ViT-emerged sinks (V-sinks), which propagate from the vision encoder, and LLM-emerged sinks (L-sinks), which arise within deep LLM layers. Based on the new definition, our analysis reveals a fundamental performance trade-off: while sinks effectively encode global scene-level priors, their dominance can suppress the fine-grained visual evidence required for local perception. Furthermore, we identify specific functional layers where modulating these sinks most significantly impacts downstream performance. To leverage these insights, we propose Layer-wise Sink Gating (LSG), a lightweight, plug-and-play module that dynamically scales the attention contributions of V-sink and the rest visual tokens. LSG is trained via standard next-token prediction, requiring no task-specific supervision while keeping the LVLM backbone frozen. In most layers, LSG yields improvements on representative multimodal benchmarks, effectively balancing global reasoning and precise local evidence.

When Sinks Help or Hurt: Unified Framework for Attention Sink in Large Vision-Language Models

Abstract

Attention sinks are defined as tokens that attract disproportionate attention. While these have been studied in single modality transformers, their cross-modal impact in Large Vision-Language Models (LVLM) remains largely unexplored: are they redundant artifacts or essential global priors? This paper first categorizes visual sinks into two distinct categories: ViT-emerged sinks (V-sinks), which propagate from the vision encoder, and LLM-emerged sinks (L-sinks), which arise within deep LLM layers. Based on the new definition, our analysis reveals a fundamental performance trade-off: while sinks effectively encode global scene-level priors, their dominance can suppress the fine-grained visual evidence required for local perception. Furthermore, we identify specific functional layers where modulating these sinks most significantly impacts downstream performance. To leverage these insights, we propose Layer-wise Sink Gating (LSG), a lightweight, plug-and-play module that dynamically scales the attention contributions of V-sink and the rest visual tokens. LSG is trained via standard next-token prediction, requiring no task-specific supervision while keeping the LVLM backbone frozen. In most layers, LSG yields improvements on representative multimodal benchmarks, effectively balancing global reasoning and precise local evidence.

Paper Structure

This paper contains 37 sections, 7 equations, 20 figures, 10 tables.

Figures (20)

  • Figure 1: Attention sink phenomena across model modalities. (a) In LLMs, certain tokens consistently receive disproportionately high attention scores across layers. (b) In ViTs, a similar pattern appears where background patches accumulate high attention. (c) In LVLMs, visual sink tokens among the vision token sequence arise from two distinct sources: those inherited from the vision encoder and those newly formed within the LLM layers, motivating three questions investigated in this work: (1) where sinks originate, (2) what information sinks encode, and (3) how sinks affect downstream task performance.
  • Figure 2: Layer-wise salience pattern of visual token groups. (a) and maintain higher $\ell_2$ norms. (b) They also receive a significantly larger percentage of attention mass compared to ordinary visual tokens.
  • Figure 3: Activation pattern of visual tokens across LVLM stack at different depth level. Each line represents one token, colored by category: orange = ViT-emerged sinks, green = LLM-emerged sinks, gray = ordinary tokens. Sink dimensions: 650 for CLIP-ViT-L; 1415, 2533 for LLaMA-2-7B. The input image is shown in Figure \ref{['fig:image_overlay']}.
  • Figure 4: Layer-wise linear probing on CLEVR scene attributes. Each panel probes a different property (count, size, color, shape) from pooled hidden states of , , and 5 randomly sampled ordinary tokens across LLM layers of LLaVA-1.5-7B.
  • Figure 5: Layer-wise optimal gate coefficients from key-gating intervention (a--c) Best sink gate per 4-layer block with accuracy change vs. baseline (positive, negative). (d) Stage-2 sweep splitting Rest into L-sinks and ordinary tokens.
  • ...and 15 more figures