Table of Contents
Fetching ...

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

Rui Xu, Yunke Wang, Yong Luo, Bo Du

TL;DR

This work introduces VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals, and treats the visual encoder and the LLM as a unified system and design a progressive pruning pipeline.

Abstract

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals. To further suppress redundancy throughout the model hierarchy, we treat the visual encoder and the LLM as a unified system and design a progressive pruning pipeline. Our method performs dominant token selection and lightweight contextual merging at multiple stages, enabling fine-grained visual information to be retained even under aggressive token budgets. Extensive experiments across diverse benchmarks show that VisionDrop achieves consistent improvements over existing approaches, despite requiring no additional training or complex modifications. Notably, when integrated with LLaVA-NeXT-7B, VisionDrop achieves a 2.7x reduction in inference latency and 6x in FLOPs, while retaining 95.71% of the original performance.

Rethinking Visual Token Reduction in LVLMs Under Cross-Modal Misalignment

TL;DR

This work introduces VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals, and treats the visual encoder and the LLM as a unified system and design a progressive pruning pipeline.

Abstract

Large Vision-Language Models (LVLMs) encode visual inputs as dense sequences of patch-level tokens to capture fine-grained semantics. These visual tokens often outnumber their textual counterparts by a large margin, leading to substantial computational overhead and limiting the scalability of LVLMs in practice. Previous efforts have explored visual token reduction either prior to or within the large language models (LLMs). However, most in-LLM reduction approaches rely on text-conditioned interactions, implicitly assuming that textual tokens can reliably capture the importance of visual tokens. In this work, we revisit this assumption and reveal causal, semantic, and spatial forms of cross-modal misalignment. These misalignments undermine the effectiveness of text-guided visual token reduction. To address this, we introduce VisionDrop, a training-free, visual-only pruning framework that selects informative visual tokens based on intra-modal (visual-to-visual) attention, without relying on textual signals. To further suppress redundancy throughout the model hierarchy, we treat the visual encoder and the LLM as a unified system and design a progressive pruning pipeline. Our method performs dominant token selection and lightweight contextual merging at multiple stages, enabling fine-grained visual information to be retained even under aggressive token budgets. Extensive experiments across diverse benchmarks show that VisionDrop achieves consistent improvements over existing approaches, despite requiring no additional training or complex modifications. Notably, when integrated with LLaVA-NeXT-7B, VisionDrop achieves a 2.7x reduction in inference latency and 6x in FLOPs, while retaining 95.71% of the original performance.

Paper Structure

This paper contains 35 sections, 4 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of visual token pruning strategies via attention maps from LLM decoding layers. Here, $x_{V1}, x_{V2}$ are visual tokens, $x_{T1}, x_{T2}$ are text instructions, and $y_1, y_2$ are autoregressively generated outputs. (a) Our method identifies important visual tokens (e.g., the trees highlighted in orange) by leveraging image-to-image attention (blue box), which reflects intra-modal relevance and avoids interference from misaligned cross-modal signals. (b) In contrast, previous approaches rely on image-to-text attention (red box) to assess visual token importance, which can be overly sensitive to cross-modal noise, resulting in preservation of semantically redundant visual regions (e.g., the grass in green).
  • Figure 2: Accuracy comparisons between text-guided and visual-only scoring within the LLM across different average remaining token levels (192, 128, 64) on four benchmarks.
  • Figure 3: Visualization of text-guided visual token retention after shallow-layer pruning (specifically, the second layer) in the LLM. Pale translucent blocks indicate pruned tokens. Retained tokens consistently cluster at the bottom of the image, revealing a positional bias caused by the causal attention in autoregressive LLMs.
  • Figure 4: Visualization of text-guided visual token retention at the end of the first three LLM stages. While the wallet-related region is correctly preserved in response to the first question, the model fails to retain relevant tokens around the umbrella in the second case, reflecting the semantic misalignment caused by modal entanglement in LLM layers.
  • Figure 5: The visual encoder and LLM are partitioned into multiple pruning stages (e.g., Stage 1–N), where the number of visual tokens is progressively reduced. At the end of each stage, the importance of visual tokens is estimated via attention-based scores (values are denoted by color intensity, bottom), computed from self-attention among visual tokens (optionally utilizing the [CLS] as the query). Low-importance tokens are merged with each other using key-value similarity to form contextual representations, which are then propagated together with informative tokens to the next stage.
  • ...and 2 more figures