Table of Contents
Fetching ...

Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?

Zichen Wen, Yifeng Gao, Weijia Li, Conghui He, Linfeng Zhang

TL;DR

This work interrogates token pruning in multimodal large language models, revealing that many sophisticated pruning strategies underperform simple baselines due to position bias and task dependency. It introduces an information-theoretic framework balancing redundancy and importance, and shows that task-specific alpha balancing yields better pruning decisions. The authors argue that real latency, not FLOPs, should drive efficiency claims and highlight training-aware compression as a powerful, often overlooked factor. Collectively, the findings suggest future token pruning should prioritize spatial uniformity, task alignment, and integration with training-time compression to achieve reliable speedups. The work provides concrete guidance for designing practical, hardware-friendly pruning methods with robust evaluation.

Abstract

Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs, leading to significant acceleration without training. While these methods claim efficiency gains, critical questions about their fundamental design and evaluation remain unanswered: Why do many existing approaches underperform even compared to naive random token selection? Are attention-based scoring sufficient for reliably identifying redundant tokens? Is language information really helpful during token pruning? What makes a good trade-off between token importance and duplication? Are current evaluation protocols comprehensive and unbiased? The ignorance of previous research on these problems hinders the long-term development of token pruning. In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods.

Token Pruning in Multimodal Large Language Models: Are We Solving the Right Problem?

TL;DR

This work interrogates token pruning in multimodal large language models, revealing that many sophisticated pruning strategies underperform simple baselines due to position bias and task dependency. It introduces an information-theoretic framework balancing redundancy and importance, and shows that task-specific alpha balancing yields better pruning decisions. The authors argue that real latency, not FLOPs, should drive efficiency claims and highlight training-aware compression as a powerful, often overlooked factor. Collectively, the findings suggest future token pruning should prioritize spatial uniformity, task alignment, and integration with training-time compression to achieve reliable speedups. The work provides concrete guidance for designing practical, hardware-friendly pruning methods with robust evaluation.

Abstract

Multimodal large language models (MLLMs) have shown remarkable performance for cross-modal understanding and generation, yet still suffer from severe inference costs. Recently, abundant works have been proposed to solve this problem with token pruning, which identifies the redundant tokens in MLLMs and then prunes them to reduce the computation and KV storage costs, leading to significant acceleration without training. While these methods claim efficiency gains, critical questions about their fundamental design and evaluation remain unanswered: Why do many existing approaches underperform even compared to naive random token selection? Are attention-based scoring sufficient for reliably identifying redundant tokens? Is language information really helpful during token pruning? What makes a good trade-off between token importance and duplication? Are current evaluation protocols comprehensive and unbiased? The ignorance of previous research on these problems hinders the long-term development of token pruning. In this paper, we answer these questions one by one, providing insights into the design of future token pruning methods.

Paper Structure

This paper contains 30 sections, 8 equations, 3 figures, 9 tables, 4 algorithms.

Figures (3)

  • Figure 1: Comparison between FastV, SparseVLM, and naive baselines. On several common datasets, the performance of FastV and SparseVLM is even worse than random token dropping and pooling.
  • Figure 2: Analysis of the distribution of tokens and attention scores over the position of tokens. Tokens with larger indexes are located at the bottom of images.
  • Figure 3: Sparse Visualization of Vanilla FastV and Window FastV with 25% Retained Visual Tokens.