Table of Contents
Fetching ...

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

Junjie Chen, Xuyang Liu, Zichen Wen, Yiyu Wang, Siteng Huang, Honggang Chen

TL;DR

This work addresses LVLM inefficiencies due to large visual token counts by introducing V^2Drop, a variation-aware token dropping method that leverages intrinsic token variation across LLM layers to identify and prune lazy tokens. By dropping low-variation tokens progressively at selected layers, V^2Drop avoids positional bias and remains compatible with efficient operators like Flash Attention, achieving near-original accuracy while delivering substantial speedups, especially in video tasks. Extensive experiments demonstrate robust performance-efficiency trade-offs across image and video benchmarks and multiple LVLMs, with clear evidence that progressive dropping outperforms one-shot strategies. The approach provides a practical, training-free acceleration mechanism for LVLMs and sets the stage for broader application to diverse vision-language tasks.

Abstract

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{V$^2$Drop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks demonstrate that our V$^2$Drop is able to maintain \textbf{94.0\%} and \textbf{98.6\%} of the original model performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}. When combined with efficient operators, V$^2$Drop further reduces GPU peak memory usage.

Variation-aware Vision Token Dropping for Faster Large Vision-Language Models

TL;DR

This work addresses LVLM inefficiencies due to large visual token counts by introducing V^2Drop, a variation-aware token dropping method that leverages intrinsic token variation across LLM layers to identify and prune lazy tokens. By dropping low-variation tokens progressively at selected layers, V^2Drop avoids positional bias and remains compatible with efficient operators like Flash Attention, achieving near-original accuracy while delivering substantial speedups, especially in video tasks. Extensive experiments demonstrate robust performance-efficiency trade-offs across image and video benchmarks and multiple LVLMs, with clear evidence that progressive dropping outperforms one-shot strategies. The approach provides a practical, training-free acceleration mechanism for LVLMs and sets the stage for broader application to diverse vision-language tasks.

Abstract

Large vision-language models (LVLMs) have demonstrated remarkable capabilities in multimodal understanding tasks. However, the increasing demand for high-resolution image and long-video understanding results in substantial token counts, leading to reduced inference efficiency. Token compression offers a direct solution by reducing the number of tokens to be processed, thereby improving computational efficiency. Through extensive analysis, we identify two critical limitations in existing inner-LLM token compression methods: positional bias and incompatibility with efficient operators, which hinder their practical deployment for LVLM acceleration. This paper presents the first approach from a token variation perspective, revealing that visual token variations within LLMs exhibit task-agnostic properties. We propose Variation-aware Vision Token Dropping (\textit{i.e.}, \textbf{VDrop}), which progressively removes visual tokens with minimal variation during LVLM inference, thereby enhancing computational efficiency. Extensive experiments across multiple models and benchmarks demonstrate that our VDrop is able to maintain \textbf{94.0\%} and \textbf{98.6\%} of the original model performance for image and video understanding tasks respectively, while reducing LLM generation latency by \textbf{31.5\%} and \textbf{74.2\%}. When combined with efficient operators, VDrop further reduces GPU peak memory usage.

Paper Structure

This paper contains 21 sections, 5 equations, 9 figures, 8 tables, 1 algorithm.

Figures (9)

  • Figure 1: Performance-Efficiency trade-offs comparison. V$^2$Drop achieves superior performance-efficiency trade-offs across both image and video understanding tasks.
  • Figure 2: Attention-guided token evaluation vs. variation-aware token evaluation. Attention-guided methods (e.g., FastV, PDrop, SparseVLM) exhibit information-agnostic positional bias, assigning high importance to later positions regardless of content (red arrows and boxes), and are incompatible with efficient operators. In contrast, measuring token-wise variation intuitively reflects token importance (green boxes) while maintaining compatibility with efficient operators.
  • Figure 3: Quantifying vision token variation with different metrics. Regions corresponding to the answer exhibit significant variation magnitudes (red boxes).
  • Figure 4: Overall framework of V$^2$Drop. V$^2$Drop measures token-wise variation across adjacent LLM layers and progressively drops vision tokens with minimal variation (i.e., lazy tokens), thereby achieving plug-and-play inference acceleration.
  • Figure 5: Effects of different variation measurement metrics. Comparison of three variation measurement methods with FastV when retaining 128 tokens on LLaVA-1.5-7B across different datasets. The red line represents the average performance gap between the three strategies and FastV, while the green line shows throughput.
  • ...and 4 more figures