Table of Contents
Fetching ...

Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More

Zichen Wen, Yifeng Gao, Shaobo Wang, Junyuan Zhang, Qintong Zhang, Weijia Li, Conghui He, Linfeng Zhang

TL;DR

This work challenges the prevailing use of attention-based token importance for pruning vision tokens in multimodal LLMs, showing it can be unreliable and hardware-incompatible. It introduces DART, a duplication-aware, training-free token reduction method that selects a small set of pivot tokens and retains tokens with low duplication to pivots, achieving substantial token reduction with minimal accuracy loss. Theoretical guarantees based on Lipschitz continuity and Hausdorff distance accompany empirical results across multi-modal benchmarks, revealing robust performance with significant speedups and broad applicability. The findings suggest that minimizing token duplication, rather than optimizing importance scores, is a more effective and hardware-friendly path for token efficiency in MLLMs.

Abstract

Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators.Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99$\times$ and 2.99$\times$ speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators. Our codes are available at https://github.com/ZichenWen1/DART.

Stop Looking for Important Tokens in Multimodal Language Models: Duplication Matters More

TL;DR

This work challenges the prevailing use of attention-based token importance for pruning vision tokens in multimodal LLMs, showing it can be unreliable and hardware-incompatible. It introduces DART, a duplication-aware, training-free token reduction method that selects a small set of pivot tokens and retains tokens with low duplication to pivots, achieving substantial token reduction with minimal accuracy loss. Theoretical guarantees based on Lipschitz continuity and Hausdorff distance accompany empirical results across multi-modal benchmarks, revealing robust performance with significant speedups and broad applicability. The findings suggest that minimizing token duplication, rather than optimizing importance scores, is a more effective and hardware-friendly path for token efficiency in MLLMs.

Abstract

Vision tokens in multimodal large language models often dominate huge computational overhead due to their excessive length compared to linguistic modality. Abundant recent methods aim to solve this problem with token pruning, which first defines an importance criterion for tokens and then prunes the unimportant vision tokens during inference. However, in this paper, we show that the importance is not an ideal indicator to decide whether a token should be pruned. Surprisingly, it usually results in inferior performance than random token pruning and leading to incompatibility to efficient attention computation operators.Instead, we propose DART (Duplication-Aware Reduction of Tokens), which prunes tokens based on its duplication with other tokens, leading to significant and training-free acceleration. Concretely, DART selects a small subset of pivot tokens and then retains the tokens with low duplication to the pivots, ensuring minimal information loss during token pruning. Experiments demonstrate that DART can prune 88.9% vision tokens while maintaining comparable performance, leading to a 1.99 and 2.99 speed-up in total time and prefilling stage, respectively, with good compatibility to efficient attention operators. Our codes are available at https://github.com/ZichenWen1/DART.

Paper Structure

This paper contains 35 sections, 3 theorems, 19 equations, 9 figures, 11 tables.

Key Result

Lemma 1

$\min_{p_i\in \mathcal{P}} |p_i-x_j|\leq (2(1-\epsilon))^{1/2}B,\quad \forall x_j\in \mathcal{X}\setminus \mathcal{R}$.

Figures (9)

  • Figure 1: Comparison between DART and FastV.Red text indicates hallucination from vanilla LLaVA-1.5-7B, green text represents hallucination from DART, and blue text represents hallucination from FastV.
  • Figure 2: Performance of FastV and SparseVLM compared with random token pruning on the LLaVA-1.5-7B, with a 88.9% token reduction ratio.
  • Figure 3: The overview of DART. The process includes (a) selecting pivot tokens, (b) calculating $\epsilon$-Duplicate scores between pivot tokens and other tokens, and (c) reducing tokens to retain those with the least duplication.
  • Figure 4: Performance-Latency trade-off comparisons across different datasets on LLaVA-Next-7B. DART consistently achieves better performance under varying latency constraints compared to other approaches.
  • Figure 5: Impact of the number of pivot tokens.
  • ...and 4 more figures

Theorems & Definitions (8)

  • Definition 1: Pivot Tokens
  • Definition 2: $\epsilon$-duplicate Score
  • Lemma 1: Bounded Distance
  • proof
  • Lemma 2: Bounded Approximation Error
  • proof
  • Theorem 1: Performance Guarantee
  • proof