Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Weihao Ye; Qiong Wu; Wenhao Lin; Yiyi Zhou

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Weihao Ye, Qiong Wu, Wenhao Lin, Yiyi Zhou

TL;DR

FitPrune tackles the high compute cost of multimodal LLMs by training-free pruning of visual tokens. It casts pruning as a distribution-fitting problem over self- and cross-attention distributions and derives a pruning recipe via a binary-search-based greedy algorithm using a small batch of inference data, avoiding retraining. The method achieves substantial FLOPs reductions (up to ~54% TFLOPs) with minimal accuracy loss across LLaVA variants and multiple VL benchmarks, with the pruning recipe computable in about 5 minutes. This approach enables fast, scalable optimization of MLLMs and is accompanied by open-source code for broader adoption.

Abstract

Recent progress in Multimodal Large Language Models(MLLMs) often use large image tokens to compensate the visual shortcoming of MLLMs, which not only exhibits obvious redundancy but also greatly exacerbates the already high computation. Token pruning is an effective solution for speeding up MLLMs, but when and how to drop tokens still remains a challenge. In this paper, we propose a novel and training-free approach for the effective visual token pruning of MLLMs, termed FitPrune, which can quickly produce a complete pruning recipe for MLLMs according to a pre-defined budget. Specifically, FitPrune considers token pruning as a statistical problem of MLLM and its objective is to find out an optimal pruning scheme that can minimize the divergence of the attention distributions before and after pruning. In practice, FitPrune can be quickly accomplished based on the attention statistics from a small batch of inference data, avoiding the expensive trials of MLLMs. According to the pruning recipe, an MLLM can directly remove the redundant visual tokens of different examples during inference. To validate FitPrune, we apply it to a set of recent MLLMs, including LLaVA-1.5, LLaVA-HR and LLaVA-NEXT, and conduct extensive experiments on a set of benchmarks. The experimental results show that our FitPrune can not only reduce the computational complexity to a large extent, while retaining high performance, e.g., -54.9% FLOPs for LLaVA-NEXT with only 0.5% accuracy drop. Notably, the pruning recipe can be obtained in about 5 minutes. Our code is available at https://github.com/ywh187/FitPrune.

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

TL;DR

Abstract

Paper Structure (19 sections, 12 equations, 11 figures, 7 tables, 1 algorithm)

This paper contains 19 sections, 12 equations, 11 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Vision-Language Models
Token Pruning
Method
Experiments
Datasets and Metrics
Implementation Details
Experimental Results
Conclusion
Acknowledgments
Appendix
Impact of pruning ratio on divergence and performance
Detailed ablation results of pruning ratio
Analysis of attention map changes
...and 4 more sections

Figures (11)

Figure 1: Visualization of the cross and self attention of image tokens of LLaVA-1.5 7B liu2023visual. These tokens become less active at the higher layer of an MLLM.
Figure 2: The impact of token pruning according to the fitting of cross- and self-attention distributions of visual tokens of LLaVA 1.5. The GQA accuracy hudson2019gqa is reported. For pruning recipes, the better fitting of attention distribution can retain better performance. However, only considering a single distribution is hard to obtain the optimal pruning recipe. In this paper, our FitPrune will consider the fitting of both cross- and self-attentions.
Figure 3: Illustration of our FitPrune. (a) FitPrune is used to reduce the length of visual tokens in the MHA of each layer. (b) The generated pruning recipe is obtained via binary search based on the attention statistics of a set of examples. Its principle is to find out the optimal pruning recipe that reduce the gap of distributions before and after pruning. (c) During inference, the MLLM can drop tokens according to the pruning recipe of FitPrune.
Figure 4: Performance of LLaVA-1.5 using different ratios of random pruning on TextVQA.
Figure 5: Performance comparison of FitPrune and other pruning methods on the LLaVA-1.5 7B w.r.t different pruning ratios.
...and 6 more figures

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

TL;DR

Abstract

Fit and Prune: Fast and Training-free Visual Token Pruning for Multi-modal Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (11)