Table of Contents
Fetching ...

TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model

Cheng Yang, Yang Sui, Jinqi Xiao, Lingyi Huang, Yu Gong, Chendi Li, Jinghua Yan, Yu Bai, Ponnuswamy Sadayappan, Xia Hu, Bo Yuan

TL;DR

TopV recasts visual token pruning in Vision-Language Models as a training-free optimization problem that identifies and removes redundant visual tokens during the prefilling stage. By introducing a visual-aware cost function that combines feature similarity, relative spatial distance, and absolute central distance, and solving via the Sinkhorn algorithm, TopV obtains a Contribution Matrix to rank token importance while remaining compatible with FlashAttention and KV cache. A token recovery step preserves coverage, and pruning is applied once per input, yielding substantial reductions in visual FLOPs and dynamic memory with modest accuracy impact across multiple VLMs and tasks. The approach achieves up to ~2.1× inference speedups and ~49–61% dynamic memory savings, demonstrating practical gains for fast and memory-efficient multimodal inference that scale across models like LLaVA and InternVL2.

Abstract

Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens for representing visual information. Previous studies have noted that visual tokens tend to receive less attention than text tokens, suggesting their lower importance during inference and potential for pruning. However, their methods encounter several challenges: reliance on greedy heuristic criteria for token importance and incompatibility with FlashAttention and KV cache. To address these issues, we introduce \textbf{TopV}, a compatible \textbf{TO}ken \textbf{P}runing with inference Time Optimization for fast and low-memory \textbf{V}LM, achieving efficient pruning without additional training or fine-tuning. Instead of relying on attention scores, we formulate token pruning as an optimization problem, accurately identifying important visual tokens while remaining compatible with FlashAttention. Additionally, since we only perform this pruning once during the prefilling stage, it effectively reduces KV cache size. Our optimization framework incorporates a visual-aware cost function considering factors such as Feature Similarity, Relative Spatial Distance, and Absolute Central Distance, to measure the importance of each source visual token, enabling effective pruning of low-importance tokens. Extensive experiments demonstrate that our method outperforms previous token pruning methods, validating the effectiveness and efficiency of our approach.

TopV: Compatible Token Pruning with Inference Time Optimization for Fast and Low-Memory Multimodal Vision Language Model

TL;DR

TopV recasts visual token pruning in Vision-Language Models as a training-free optimization problem that identifies and removes redundant visual tokens during the prefilling stage. By introducing a visual-aware cost function that combines feature similarity, relative spatial distance, and absolute central distance, and solving via the Sinkhorn algorithm, TopV obtains a Contribution Matrix to rank token importance while remaining compatible with FlashAttention and KV cache. A token recovery step preserves coverage, and pruning is applied once per input, yielding substantial reductions in visual FLOPs and dynamic memory with modest accuracy impact across multiple VLMs and tasks. The approach achieves up to ~2.1× inference speedups and ~49–61% dynamic memory savings, demonstrating practical gains for fast and memory-efficient multimodal inference that scale across models like LLaVA and InternVL2.

Abstract

Vision-Language Models (VLMs) demand substantial computational resources during inference, largely due to the extensive visual input tokens for representing visual information. Previous studies have noted that visual tokens tend to receive less attention than text tokens, suggesting their lower importance during inference and potential for pruning. However, their methods encounter several challenges: reliance on greedy heuristic criteria for token importance and incompatibility with FlashAttention and KV cache. To address these issues, we introduce \textbf{TopV}, a compatible \textbf{TO}ken \textbf{P}runing with inference Time Optimization for fast and low-memory \textbf{V}LM, achieving efficient pruning without additional training or fine-tuning. Instead of relying on attention scores, we formulate token pruning as an optimization problem, accurately identifying important visual tokens while remaining compatible with FlashAttention. Additionally, since we only perform this pruning once during the prefilling stage, it effectively reduces KV cache size. Our optimization framework incorporates a visual-aware cost function considering factors such as Feature Similarity, Relative Spatial Distance, and Absolute Central Distance, to measure the importance of each source visual token, enabling effective pruning of low-importance tokens. Extensive experiments demonstrate that our method outperforms previous token pruning methods, validating the effectiveness and efficiency of our approach.

Paper Structure

This paper contains 19 sections, 5 equations, 11 figures, 10 tables, 1 algorithm.

Figures (11)

  • Figure 1: The pipeline of proposed TopV. In the prefilling stage during the inference time, we aim to prune the visual tokens in the $L_i$ layer. First, we select input visual tokens as source tokens and then collect output tokens after the Post-LN layer within the same layer as target tokens. We then formulate the token pruning as an optimization problem, incorporating factors such as feature similarity, relative spatial distance, and absolute central distance. Using the Sinkhorn algorithm to solve this problem, we obtain the contribution matrix of source tokens and their importance. Based on these values, we prune unimportant tokens and uniformly recover a subset of pruned tokens to maintain structural integrity. Starting from the layer $L_{i+1}$, tokens are consistently pruned, leading to faster, low-memory VLM inference. Following chen2024image, we set $L_i=2$ in our experiments. Notably, our optimization process requires only 2 ms, comprising less than $1\%$ of total inference time.
  • Figure 2: Illustration of target token positions within a transformer layer of VLMs. Positions 1, 2, 3, and 4 correspond to the outputs from the Pre-LN layer, Attention layer, Post-LN layer, and MLP layer, respectively. We empirically select the output after the Post-LN layer (Position 3) as the target tokens.
  • Figure 3: Memory Usage of TopV (a, d), Baseline (b, e), and FastV (c, f) on AI2D and OCRBench tasks for InternVL2-2B. The red circle indicates that a new token is currently being processed.
  • Figure 4: Two visualization heat maps for various hyperparameters.
  • Figure 5: Two visualization examples of important tokens identified by various visual token pruning methods. Left: Important tokens selected by FastV. Right: Important tokens selected by our TopV. The gray patches around the images are introduced during preprocessing, while the red patches indicate the token regions the model focuses on.
  • ...and 6 more figures