Table of Contents
Fetching ...

To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

Jiayun Luo, Wan-Cyuan Fan, Lyuyang Wang, Xiangteng He, Tanzila Rahman, Purang Abolmaesumi, Leonid Sigal

TL;DR

This work investigates ViT attention sinks in large vision-language models (LVLMs), showing that high-norm ViT tokens propagate into the LLM and encode coarse, high-level visual context. The authors introduce two strategies to harness these sinks: a training-free sink-front repositioning and a training-based DIYSink framework with Dual-MLP projections and dynamic token selection (CoT routing or Reweighting MLP). Across multiple ViT backbones and LLMs, these methods yield consistent gains on broad visual reasoning benchmarks, notably improving global reasoning tasks and complex mathematical/code reasoning. The findings offer a deeper understanding of cross-modal attention in LVLMs and provide practical, low-overhead approaches to enhance visual reasoning by leveraging ViT sinks.

Abstract

Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.

To Sink or Not to Sink: Visual Information Pathways in Large Vision-Language Models

TL;DR

This work investigates ViT attention sinks in large vision-language models (LVLMs), showing that high-norm ViT tokens propagate into the LLM and encode coarse, high-level visual context. The authors introduce two strategies to harness these sinks: a training-free sink-front repositioning and a training-based DIYSink framework with Dual-MLP projections and dynamic token selection (CoT routing or Reweighting MLP). Across multiple ViT backbones and LLMs, these methods yield consistent gains on broad visual reasoning benchmarks, notably improving global reasoning tasks and complex mathematical/code reasoning. The findings offer a deeper understanding of cross-modal attention in LVLMs and provide practical, low-overhead approaches to enhance visual reasoning by leveraging ViT sinks.

Abstract

Large Vision Language Models (LVLMs) have recently emerged as powerful architectures capable of understanding and reasoning over both visual and textual information. These models typically rely on two key components: a Vision Transformer (ViT) and a Large Language Model (LLM). ViT encodes visual content into a sequence of image tokens and serves as the perceptual front-end -- the eyes of the model. In contrast, the LLM interprets these tokens to perform high-level reasoning, generates responses, and functions as the cognitive core -- the brain of the model. However, it remains unclear which visual tokens contribute most significantly to understanding and reasoning, and how effectively these signals are propagated from ViT to the LLM. While most existing works have focused on identifying attention sinks, low-semantic tokens receiving disproportionately high attention, within the LLM, we shift the focus to the vision encoder by identifying a class of high-norm visual tokens from ViT, referred to as ViT attention sinks -- a problem that has been rarely studied but is indeed very important for LVLMs. Our findings show that these ViT sinks encapsulate high-level semantic concepts from images, allowing the LLM to perform more effective understanding and reasoning. Despite their importance, these sink tokens are often overlooked in existing LVLM architectures. To explore their contribution, we present both qualitative and quantitative analyses of the information embedded in these sink tokens. We also propose both training-free and training-based approaches to better leverage how this information is interpreted by the LLM, and to what extent. By explicitly utilizing these tokens, we demonstrate substantial improvements across a range of LVLMs and visual reasoning tasks, highlighting the untapped potential of ViT attention sinks in enhancing visual reasoning.

Paper Structure

This paper contains 40 sections, 3 equations, 15 figures, 14 tables.

Figures (15)

  • Figure 1: Illustration of ViT and LLM attention sinks in LLaVA-v1.5-7B. In LVLMs, given an image (A), we find that ViT sinks (B) are partially propagated into the LLM as (C), alongside LLM-emerged sinks (D), together outlining all sinks within the VLM (E).
  • Figure 2: Overview.DIYSink leverages a Dual-MLP Projector to correctly project sink and non-sink tokens, and one of the two token selection modules, CoT-Reweighting or MLP-Reweighting, to dynamically select the best set of tokens for the LLM based on the specific input.
  • Figure 3: Attention to ViT sink tokens and sink dimensions of ViT and LLM sinks in LLaVA-v1.5-7B. (A) compares ViT token norms with the attention assigned during LLM decoding. (B)–(C) show the sink-dimension distributions for LLM-emergent sinks and ViT-propagated sinks.
  • Figure 4: (A) Flow of obtaining relevancy map and word distribution. (B) Relevance map of sink and non-sink tokens. H12 and H10 denote the respective foreground and background attention head of the penultimate layer of the LLM used to extract the relevance maps. (C) Word Distribution of sink and non-sink tokens. The first and second rows represent the word distributions obtained from 300 images where cat and person are the main objects.
  • Figure 5: (A) Task Clustering based on GPT-4o annotated Image Complexity and Query Globalness. (B) Performance analysis of two model variants (i.e.Sink-only and Non-sink-only) for evaluating the influence of ViT sink tokens on different tasks.
  • ...and 10 more figures