Table of Contents
Fetching ...

PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models

Yu Meng, Kaiyuan Li, Chenran Huang, Chen Gao, Xinlei Chen, Yong Li, Xiaoping Zhang

TL;DR

This work tackles the computational bottlenecks of inference in large vision-language models caused by lengthy vision token sequences. It introduces PLPHP, a plug-and-play, two-level pruning framework that first allocates layer-specific vision-token retention via Layer-Level Retention Rate Allocation and then performs per-head Vision Token Pruning within selected decoder layers. By exploiting Vision Token Re-attention patterns and per-head specialization, PLPHP achieves about 18% decoding speedup and over 50% KV cache reduction with only around 0.46% average performance loss, while delivering improvements on multi-image tasks and generalizing across multiple LVLM backbones. The approach is efficient, training-free, and practical for scaling LVLMs to more complex multimodal tasks in real-world settings, with public code to follow.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a range of multimodal tasks. However, their inference efficiency is constrained by the large number of visual tokens processed during decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the Vision Token Re-attention phenomenon across decoder layers, we dynamically adjust token retention rates layer by layer. Layers that exhibit stronger attention to visual information preserve more vision tokens, while layers with lower vision attention are aggressively pruned. Furthermore, PLPHP applies pruning at the attention head level, enabling different heads within the same layer to independently retain critical context. Experiments on multiple benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and reduces the Key-Value Cache (KV Cache) size by over 50%, all at the cost of 0.46% average performance drop, while also achieving notable performance improvements in multi-image tasks. These results highlight the effectiveness of fine-grained token pruning and contribute to advancing the efficiency and scalability of LVLMs. Our source code will be made publicly available.

PLPHP: Per-Layer Per-Head Vision Token Pruning for Efficient Large Vision-Language Models

TL;DR

This work tackles the computational bottlenecks of inference in large vision-language models caused by lengthy vision token sequences. It introduces PLPHP, a plug-and-play, two-level pruning framework that first allocates layer-specific vision-token retention via Layer-Level Retention Rate Allocation and then performs per-head Vision Token Pruning within selected decoder layers. By exploiting Vision Token Re-attention patterns and per-head specialization, PLPHP achieves about 18% decoding speedup and over 50% KV cache reduction with only around 0.46% average performance loss, while delivering improvements on multi-image tasks and generalizing across multiple LVLM backbones. The approach is efficient, training-free, and practical for scaling LVLMs to more complex multimodal tasks in real-world settings, with public code to follow.

Abstract

Large Vision-Language Models (LVLMs) have demonstrated remarkable capabilities across a range of multimodal tasks. However, their inference efficiency is constrained by the large number of visual tokens processed during decoding. To address this challenge, we propose Per-Layer Per-Head Vision Token Pruning (PLPHP), a two-level fine-grained pruning method including Layer-Level Retention Rate Allocation and Head-Level Vision Token Pruning. Motivated by the Vision Token Re-attention phenomenon across decoder layers, we dynamically adjust token retention rates layer by layer. Layers that exhibit stronger attention to visual information preserve more vision tokens, while layers with lower vision attention are aggressively pruned. Furthermore, PLPHP applies pruning at the attention head level, enabling different heads within the same layer to independently retain critical context. Experiments on multiple benchmarks demonstrate that PLPHP delivers an 18% faster decoding speed and reduces the Key-Value Cache (KV Cache) size by over 50%, all at the cost of 0.46% average performance drop, while also achieving notable performance improvements in multi-image tasks. These results highlight the effectiveness of fine-grained token pruning and contribute to advancing the efficiency and scalability of LVLMs. Our source code will be made publicly available.

Paper Structure

This paper contains 23 sections, 12 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: The phenomenon of Vision Token Re-attention in different LVLMs. Various LVLMs demonstrate the phenomenon of refocusing on images within deep decoder layers. In these layers, the attention scores corresponding to vision tokens increase, as indicated by the red boxes highlighted in the figure.
  • Figure 2: The proportion of attention scores received by different parts of the same image varies across different decoder layers. Each polyline in the figure represents the proportion of attention scores for the corresponding group of tokens across different decoder layers.
  • Figure 3: Visualization of attention maps in various attention heads. Different heads within the same decoder layer exhibit different attention patterns.
  • Figure 4: Overview of PLPHP. PLPHP has a two-level design including Layer-Level Retention Rate Allocation (as indicated by the red dashed boxes) and Head-Level Vision Token Pruning (as indicated by the blue dashed boxes). Upon the completion of prefilling a certain decoder layer, PLPHP categorizes the layer as vision indifferent, balanced or attentive, and assigns a vision token retention rate to the layer based on its average attention scores to the vision tokens. Subsequently, according to the allocated retention rate, PLPHP performs fine-grained pruning for each head within the layer. It removes the visual tokens with lower attention scores from the KV cache of each attention head, ensuring that the remaining proportion of vision tokens does not exceed the pre-assigned retention rate.
  • Figure 5: Visualization of vision token retention rates and performance across seven different benchmarks. A point on each polyline represents a certain hyperparameter setting. We record the vision token retention rate and performance of the method under the corresponding setting. For VTW, we evaluated cases with $K=10, 14$ and $20$. For FastV, we assessed the cases of $(K, R)=(2,0.75), (3,0.5)$ and $(3,0.25)$. As for PLPHP, we examined the situations where $(r, \Delta r)=(0.3,0.3), (0.4,0.3)$ and $(0.5,0.3)$.
  • ...and 3 more figures