Table of Contents
Fetching ...

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

Saeed Ranjbar Alvar, Gursimran Singh, Mohammad Akbari, Yong Zhang

TL;DR

DivPrune tackles the inefficiency of large multimodal models by pruning visual tokens without fine-tuning. It reframes pruning as a Max-Min Diversity Problem ($MMDP$) to maximize diversity among retained tokens using a cosine-distance metric, $d(\gamma,\omega)=1-\frac{\gamma\cdot \omega}{\|\gamma\|\|\omega\|}$, selecting a subset of size $\tilde{M}$. The method is calibration-free and plug-and-play, applicable to various LMMs and layers. Experiments across 16 image- and video-language datasets show strong accuracy gains and meaningful efficiency improvements, enabling practical deployment.

Abstract

Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected {tokens} is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with various LMMs show that DivPrune achieves state-of-the-art accuracy over 16 image- and video-language datasets. Additionally, DivPrune reduces both the end-to-end latency and GPU memory usage for the tested models. The code is available $\href{https://github.com/vbdi/divprune}{\text{here}}$.

DivPrune: Diversity-based Visual Token Pruning for Large Multimodal Models

TL;DR

DivPrune tackles the inefficiency of large multimodal models by pruning visual tokens without fine-tuning. It reframes pruning as a Max-Min Diversity Problem () to maximize diversity among retained tokens using a cosine-distance metric, , selecting a subset of size . The method is calibration-free and plug-and-play, applicable to various LMMs and layers. Experiments across 16 image- and video-language datasets show strong accuracy gains and meaningful efficiency improvements, enabling practical deployment.

Abstract

Large Multimodal Models (LMMs) have emerged as powerful models capable of understanding various data modalities, including text, images, and videos. LMMs encode both text and visual data into tokens that are then combined and processed by an integrated Large Language Model (LLM). Including visual tokens substantially increases the total token count, often by thousands. The increased input length for LLM significantly raises the complexity of inference, resulting in high latency in LMMs. To address this issue, token pruning methods, which remove part of the visual tokens, are proposed. The existing token pruning methods either require extensive calibration and fine-tuning or rely on suboptimal importance metrics which results in increased redundancy among the retained tokens. In this paper, we first formulate token pruning as Max-Min Diversity Problem (MMDP) where the goal is to select a subset such that the diversity among the selected {tokens} is maximized. Then, we solve the MMDP to obtain the selected subset and prune the rest. The proposed method, DivPrune, reduces redundancy and achieves the highest diversity of the selected tokens. By ensuring high diversity, the selected tokens better represent the original tokens, enabling effective performance even at high pruning ratios without requiring fine-tuning. Extensive experiments with various LMMs show that DivPrune achieves state-of-the-art accuracy over 16 image- and video-language datasets. Additionally, DivPrune reduces both the end-to-end latency and GPU memory usage for the tested models. The code is available .

Paper Structure

This paper contains 23 sections, 5 equations, 7 figures, 6 tables, 1 algorithm.

Figures (7)

  • Figure 1: Comparison of different visual token pruning methods across various pruning ratios for LLaVA 1.5-7B. The y-axis is the performance averaged on COCO (CIDEr), OKVQA (Acc), POPE (F1), and MMBench (Acc). The x-axis is the TFLOP ratio of the model after token pruning compared to the original model before pruning. The proposed method significantly outperforms all baselines. Note that, unlike other methods, FitPrune uses an additional calibration step to prune tokens.
  • Figure 2: An overview of the LMM architecture, with DivPrune applied to visual tokens. The blocks on the right-hand side illustrate the steps of the method.
  • Figure 3: (a) t-SNE visualization of visual tokens for the original model, our method, and FastV. (b) Histogram of the Max-Min distance between the selected tokens over the SeedBench dataset.
  • Figure 4: Comparison of different visual token pruning methods across various pruning ratios for LLaVA 1.5-13B. The y-axis is the performance averaged on COCO (CIDEr), OKVQA (Acc), POPE (F1), and MMBench (Acc). The x-axis is the TFLOP ratio of the model after token pruning compared to the original model before pruning.
  • Figure 5: (a)-(b) t-SNE visualization of visual tokens using SeedBench samples, (c)-(e) t-SNE visualization of visual tokens using GQA samples, (f) Histogram of the Max-Min distance between the selected tokens over the GQA dataset.
  • ...and 2 more figures