Table of Contents
Fetching ...

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

Ting Liu, Liangtao Shi, Richang Hong, Yue Hu, Quanjun Yin, Linfeng Zhang

TL;DR

Vision-token redundancy in multimodal large language models imposes high computational costs. MustDrop introduces a multi-stage, training-free framework that progressively prunes vision tokens during encoding, prefilling, and decoding via Local Spatial Merging, dual-attention filtering, and an output-aware KV cache policy. The approach achieves substantial efficiency gains (e.g., ~88.9% token reduction, ~88.5% FLOPs saved) with minimal accuracy loss, outperforming state-of-the-art single-stage methods on multiple image and video benchmarks. This work provides a practical, plug-and-play solution to scale MLLMs in real-world scenarios without additional training.

Abstract

The vision tokens in multimodal large language models usually exhibit significant spatial and temporal redundancy and take up most of the input tokens, which harms their inference efficiency. To solve this problem, some recent works were introduced to drop the unimportant tokens during inference where the importance of each token is decided only by the information in either the vision encoding stage or the prefilling stage. In this paper, we propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle, including the vision encoding stage, prefilling stage, and decoding stage. Concretely, in the visual encoding stage, MustDrop merges spatially adjacent tokens with high similarity, and establishes a key token set to retain the most vision-critical tokens, preventing them from being discarded in later stages. In the prefilling stage, MustDrop further compresses vision tokens by the guidance of text semantics, with a dual-attention filtering strategy. In the decoding stage, an output-aware cache policy is proposed to further reduce the size of the KV cache. By leveraging tailored strategies in the multi-stage process, MustDrop can more precisely recognize the important and redundant tokens, thus achieving an optimal balance between performance and efficiency. For instance, MustDrop reduces about 88.5\% FLOPs on LLaVA with a compression ratio of 92.2\% while maintaining comparable accuracy. Our codes are available at \url{https://github.com/liuting20/MustDrop}.

Multi-Stage Vision Token Dropping: Towards Efficient Multimodal Large Language Model

TL;DR

Vision-token redundancy in multimodal large language models imposes high computational costs. MustDrop introduces a multi-stage, training-free framework that progressively prunes vision tokens during encoding, prefilling, and decoding via Local Spatial Merging, dual-attention filtering, and an output-aware KV cache policy. The approach achieves substantial efficiency gains (e.g., ~88.9% token reduction, ~88.5% FLOPs saved) with minimal accuracy loss, outperforming state-of-the-art single-stage methods on multiple image and video benchmarks. This work provides a practical, plug-and-play solution to scale MLLMs in real-world scenarios without additional training.

Abstract

The vision tokens in multimodal large language models usually exhibit significant spatial and temporal redundancy and take up most of the input tokens, which harms their inference efficiency. To solve this problem, some recent works were introduced to drop the unimportant tokens during inference where the importance of each token is decided only by the information in either the vision encoding stage or the prefilling stage. In this paper, we propose Multi-stage Token Dropping (MustDrop) to measure the importance of each token from the whole lifecycle, including the vision encoding stage, prefilling stage, and decoding stage. Concretely, in the visual encoding stage, MustDrop merges spatially adjacent tokens with high similarity, and establishes a key token set to retain the most vision-critical tokens, preventing them from being discarded in later stages. In the prefilling stage, MustDrop further compresses vision tokens by the guidance of text semantics, with a dual-attention filtering strategy. In the decoding stage, an output-aware cache policy is proposed to further reduce the size of the KV cache. By leveraging tailored strategies in the multi-stage process, MustDrop can more precisely recognize the important and redundant tokens, thus achieving an optimal balance between performance and efficiency. For instance, MustDrop reduces about 88.5\% FLOPs on LLaVA with a compression ratio of 92.2\% while maintaining comparable accuracy. Our codes are available at \url{https://github.com/liuting20/MustDrop}.

Paper Structure

This paper contains 17 sections, 10 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Comparison of vision token dropping methods: (a) methods that only drop tokens during the vision encoding stage, i.e., PruMerge and ToMe, (b) methods that remove tokens limited to the prefilling phase, i.e., FastV and SparseVLM, and (c) our Mustdrop approach, which gradually removes invalid tokens during the vision encoding, prefilling, and decoding stages.
  • Figure 2: The architecture of MustDrop. In the vision encoding stage, MustDrop merges similar neighboring tokens by window scanning, and establishes a key set of vision-critical tokens that will not be removed at any stage. Then, the dual-attention filtering mechanism decides whether to prune tokens during prefilling. Finally, the output-aware KV cache policy further removes vision tokens during decoding.
  • Figure 3: The importance of individual text tokens. (a) Visualization of token dropping by Our MustDrop. (b) Visualization of Dual-Attention Scores Distribution for Vision Tokens.
  • Figure 4: Visualization of attention during the decoding process of for LLaVA1.5-7B at certain layers. The attention maps of all layers can be seen in the Appendix.