Table of Contents
Fetching ...

HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score

Jingqi Xu, Jingxi Lu, Chenghao Li, Sreetama Sarkar, Peter A. Beerel

TL;DR

HIVTP tackles the inefficiency of vision-language models caused by vast visual token sets by introducing a training-free, middle-layer attention-based scoring and a two-stage hierarchical pruning scheme. It retains globally important tokens via region-wise TopK selection and locally important tokens via window-based selection, before projecting them to the language space. Across LLaVA-v1.5-7B and LLaVA-Next-7B on eight benchmarks, HIVTP achieves substantial TTFT reductions (up to 50–55%) and throughput gains (up to 60.9%), with accuracy preserved and in some cases improved. The approach outperforms prior training-free methods by maintaining fine-grained and object-level visual information through middle-layer cues and spatially aware retention.

Abstract

Vision-Language Models (VLMs) have shown strong capabilities on diverse multimodal tasks. However, the large number of visual tokens output by the vision encoder severely hinders inference efficiency, and prior studies have shown that many of these tokens are not important and can therefore be safely pruned. In this work, we propose HIVTP, a training-free method to improve VLMs efficiency via hierarchical visual token pruning using a novel middle-layer-based importance score. Specifically, we utilize attention maps extracted from the middle layers of the vision encoder, which better reflect fine-grained and object-level attention, to estimate visual token importance. Based on this, we propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens. Specifically, we reshape the 1-D visual token sequence output by the vision encoder into a 2-D spatial layout. In the global retaining stage, we divide the image into regions and retain tokens with higher importance scores in each region; in the local retaining stage, we then divide the image into small windows and retain the most important token in each local window. Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively, and improve the token generation throughput by up to 60.9% and 47.3%, without sacrificing accuracy, and even achieving improvements on certain benchmarks. Compared with prior works, HIVTP achieves better accuracy while offering higher inference efficiency.

HIVTP: A Training-Free Method to Improve VLMs Efficiency via Hierarchical Visual Token Pruning Using Middle-Layer-Based Importance Score

TL;DR

HIVTP tackles the inefficiency of vision-language models caused by vast visual token sets by introducing a training-free, middle-layer attention-based scoring and a two-stage hierarchical pruning scheme. It retains globally important tokens via region-wise TopK selection and locally important tokens via window-based selection, before projecting them to the language space. Across LLaVA-v1.5-7B and LLaVA-Next-7B on eight benchmarks, HIVTP achieves substantial TTFT reductions (up to 50–55%) and throughput gains (up to 60.9%), with accuracy preserved and in some cases improved. The approach outperforms prior training-free methods by maintaining fine-grained and object-level visual information through middle-layer cues and spatially aware retention.

Abstract

Vision-Language Models (VLMs) have shown strong capabilities on diverse multimodal tasks. However, the large number of visual tokens output by the vision encoder severely hinders inference efficiency, and prior studies have shown that many of these tokens are not important and can therefore be safely pruned. In this work, we propose HIVTP, a training-free method to improve VLMs efficiency via hierarchical visual token pruning using a novel middle-layer-based importance score. Specifically, we utilize attention maps extracted from the middle layers of the vision encoder, which better reflect fine-grained and object-level attention, to estimate visual token importance. Based on this, we propose a hierarchical visual token pruning method to retain both globally and locally important visual tokens. Specifically, we reshape the 1-D visual token sequence output by the vision encoder into a 2-D spatial layout. In the global retaining stage, we divide the image into regions and retain tokens with higher importance scores in each region; in the local retaining stage, we then divide the image into small windows and retain the most important token in each local window. Experimental results show that our proposed method, HIVTP, can reduce the time-to-first-token (TTFT) of LLaVA-v1.5-7B and LLaVA-Next-7B by up to 50.0% and 55.1%, respectively, and improve the token generation throughput by up to 60.9% and 47.3%, without sacrificing accuracy, and even achieving improvements on certain benchmarks. Compared with prior works, HIVTP achieves better accuracy while offering higher inference efficiency.

Paper Structure

This paper contains 19 sections, 12 equations, 13 figures, 3 tables, 2 algorithms.

Figures (13)

  • Figure 1: Framework of HIVTP. a) The overall workflow of HIVTP for VLMs. b) First, we leverage the attention maps from the middle layers of the vision encoder to compute the importance scores of visual tokens. Then, we adopt a hierarchical visual token pruning strategy to retain both globally important and locally important visual tokens.
  • Figure 2: Attention heatmaps at different layers of the vision encoder in LLaVA-v1.5-7B for one example from the MME benchmark. From layer 7 to layer 10, the visual tokens corresponding to the main objects in the image exhibit higher brightness, indicating higher attention weights.
  • Figure 3: Comparison of HIVTP with and without the local retaining stage on a POPE example. HIVTP ($k=50$, w/o local retaining) denotes the variant without the local retaining stage, which hallucinates a car in the image. In contrast, HIVTP ($k=25$, c=2), which applies a window size of $2 \times 2$ in the local retaining stage, avoids hallucination. Green areas indicate the retained globally important visual tokens, while red areas indicate the retained locally important visual tokens. The region marked by the purple ellipse highlights a large contiguous area of pruned tokens.
  • Figure 4: Attention heatmaps at different layers of the vision encoder in LLaVA-v1.5-7B for one example from the MME benchmark.
  • Figure 5: Attention heatmaps at different layers of the vision encoder in LLaVA-v1.5-7B for one example from the MME benchmark.
  • ...and 8 more figures