Table of Contents
Fetching ...

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

Qizhe Zhang, Aosong Cheng, Ming Lu, Renrui Zhang, Zhiyong Zhuo, Jiajun Cao, Shaobo Guo, Qi She, Shanghang Zhang

TL;DR

This work identifies fundamental limitations of pruning LVLMs via text-visual attention, driven by attention shift from rotary position embeddings and dispersion. It introduces VisPruner, a training-free pruning method that first selects important visual tokens using CLS attention from the visual encoder and then supplements with diverse tokens chosen by similarity-based pruning, ensuring broader visual coverage. By pruning before the language model, VisPruner achieves dramatic reductions in FLOPs and latency while preserving performance across image and video benchmarks and multiple VLM architectures, including high-resolution and video settings. The approach demonstrates strong empirical gains, high compatibility with fast attention mechanisms, and offers a practical pathway to efficient multimodal inference without additional training.

Abstract

Large vision-language models (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual tokens early within the language model. Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the language model and find that this score is not an ideal indicator for token pruning. Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens alongside the initially selected important tokens, we maximally preserve the visual information of the input image. Experimental results demonstrate that our VisPruner sustains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing methods based on text-visual attention. Notably, without any training, VisPruner can reduce the FLOPs of LLaVA-1.5-7B by 91% and inference latency by 75%, while maintaining comparable performance. Our code is available at https://github.com/Theia-4869/VisPruner.

Beyond Text-Visual Attention: Exploiting Visual Cues for Effective Token Pruning in VLMs

TL;DR

This work identifies fundamental limitations of pruning LVLMs via text-visual attention, driven by attention shift from rotary position embeddings and dispersion. It introduces VisPruner, a training-free pruning method that first selects important visual tokens using CLS attention from the visual encoder and then supplements with diverse tokens chosen by similarity-based pruning, ensuring broader visual coverage. By pruning before the language model, VisPruner achieves dramatic reductions in FLOPs and latency while preserving performance across image and video benchmarks and multiple VLM architectures, including high-resolution and video settings. The approach demonstrates strong empirical gains, high compatibility with fast attention mechanisms, and offers a practical pathway to efficient multimodal inference without additional training.

Abstract

Large vision-language models (LVLMs) generally contain significantly more visual tokens than their textual counterparts, resulting in a considerable computational burden. Recent efforts have been made to tackle this issue by pruning visual tokens early within the language model. Most existing works use attention scores between text and visual tokens to assess the importance of visual tokens. However, in this study, we first analyze the text-visual attention in the language model and find that this score is not an ideal indicator for token pruning. Based on the analysis, We propose VisPruner, a plug-and-play method that utilizes visual cues for more effective token pruning in LVLMs. Specifically, we first use visual attention to select a limited number of significant tokens. Then, we remove duplicate tokens from the remaining ones based on their similarity. By retaining diverse tokens alongside the initially selected important tokens, we maximally preserve the visual information of the input image. Experimental results demonstrate that our VisPruner sustains strong performance across various VLM architectures and reduction ratios, significantly outperforming existing methods based on text-visual attention. Notably, without any training, VisPruner can reduce the FLOPs of LLaVA-1.5-7B by 91% and inference latency by 75%, while maintaining comparable performance. Our code is available at https://github.com/Theia-4869/VisPruner.

Paper Structure

This paper contains 28 sections, 6 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: The illustration of different pruning methods. Correct answer parts are shown in blue, while hallucinations due to pruning are shown in red. Text-visual attention methods like FastV often preserve the lower parts of input images, which can lead to the loss of crucial visual information during early pruning (e.g. the iron in the man's hand). Random pruning removes positional bias but still fails to preserve important visual information, leading to even more hallucinations (e.g. suitcase and stop sign). Our VisPruner, which uses visual cues for pruning, effectively answers the question with a significant token reduction ratio and provides more details (e.g. the type and color of the car, shown in green).
  • Figure 2: Analysis of text-visual attention shift. (a) Selection frequency and received attention of visual tokens. There is a clear positive correlation between selection frequency and the attention received, which is also accompanied by a positional bias. (b) Proportion of visual tokens with top 25% text-visual attention across each quartile position. In the shallower layers of the language model, attention is notably concentrated on visual tokens with larger indices. (c) Performance with only visual tokens from each quartile position. In the shallower layers, retaining visual tokens located in the central region yields higher performance.
  • Figure 3: Analysis of text-visual attention dispersion. (a) Cumulative distribution of different attentions. In [CLS] attention, a few visual tokens absorb the majority of attention, while last attention is relatively dispersed. (b) Density distribution of different attentions. Unlike [CLS] attention, the density of last attention, both before and after the elimination of position embedding decay, is obviously high entropy and low peak. (c) Performance of pruning based on different attentions across various benchmarks. Last attention, with position embedding decay removed, achieves performance gains but still underperforms random pruning on some benchmarks due to the dispersion issue, while pruning based on [CLS] attention consistently takes the lead.
  • Figure 4: Illustration of VisPruner. We begin by selecting a small portion of important tokens with rich information, based on the [CLS] attention from the visual encoder. For the remaining tokens, we progressively remove duplicates based on similarity, ultimately retaining another set of diverse tokens. These two parts complement each other, ensuring that the model maintains comparable performance even after a significant reduction of visual tokens, without relying on any additional training.
  • Figure 5: Ablation study of the core components.Random denotes randomly selecting tokens, Important refers to selecting only tokens with high attention scores, and VisPruner represents the final version of our method, which also includes a diverse set of tokens with low redundancy as a complement.
  • ...and 4 more figures