Table of Contents
Fetching ...

freePruner: A Training-free Approach for Large Multimodal Model Acceleration

Bingxin Xu, Yuzhang Shang, Yunhao Ge, Qian Lou, Yan Yan

TL;DR

FreePruner is a training-free token reduction approach that can be directly applied to any open-source LMM without additional training, and is orthogonal to and can be combined with other post-training acceleration techniques, such as post-training quantization, providing a practical solution for efficient LMM deployment.

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive capabilities in visual-language tasks but face significant deployment challenges due to their high computational demands. While recent token reduction methods show promise for accelerating LMMs, they typically require extensive retraining or fine-tuning, making them impractical for many state-of-the-art models, especially those with proprietary training data. We propose freePruner, a training-free token reduction approach that can be directly applied to any open-source LMM without additional training. Unlike existing methods that rely heavily on token merging operations, freePruner employs a two-stage token selection strategy: (1) identifying pivotal tokens that capture high-level semantic information using our designed contribution degree metric, and (2) selecting complementary tokens that preserve essential low-level visual details through attention pattern analysis. Extensive experiments demonstrate that freePruner achieves 2x acceleration while maintaining comparable performance across mainstream visual question-answering benchmarks in the training-free setting. Moreover, freePruner is orthogonal to and can be combined with other post-training acceleration techniques, such as post-training quantization, providing a practical solution for efficient LMM deployment.

freePruner: A Training-free Approach for Large Multimodal Model Acceleration

TL;DR

FreePruner is a training-free token reduction approach that can be directly applied to any open-source LMM without additional training, and is orthogonal to and can be combined with other post-training acceleration techniques, such as post-training quantization, providing a practical solution for efficient LMM deployment.

Abstract

Large Multimodal Models (LMMs) have demonstrated impressive capabilities in visual-language tasks but face significant deployment challenges due to their high computational demands. While recent token reduction methods show promise for accelerating LMMs, they typically require extensive retraining or fine-tuning, making them impractical for many state-of-the-art models, especially those with proprietary training data. We propose freePruner, a training-free token reduction approach that can be directly applied to any open-source LMM without additional training. Unlike existing methods that rely heavily on token merging operations, freePruner employs a two-stage token selection strategy: (1) identifying pivotal tokens that capture high-level semantic information using our designed contribution degree metric, and (2) selecting complementary tokens that preserve essential low-level visual details through attention pattern analysis. Extensive experiments demonstrate that freePruner achieves 2x acceleration while maintaining comparable performance across mainstream visual question-answering benchmarks in the training-free setting. Moreover, freePruner is orthogonal to and can be combined with other post-training acceleration techniques, such as post-training quantization, providing a practical solution for efficient LMM deployment.

Paper Structure

This paper contains 16 sections, 3 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: A training-free approach for LMM acceleration. rgb]1.0, 0.75, 0.0freePruner enables automatic token selection using only pretrained LMMs, requiring no additional training or fine-tuning. Our method achieves acceleration while maintaining model performance, providing "free-lunch" speedup for LMM inference.
  • Figure 2: Performance comparison across VQA tasks.freePruner achieves 2× acceleration of LLaVA while maintaining comparable performance across mainstream VQA benchmarks. Notably, our token selection approach is orthogonal to other training-free LLM acceleration methods (e.g., quantization), enabling potential combinations for even greater efficiency gains.
  • Figure 3: The overview of freePruner.freePruner has 2 modules: (1)Identify rgb]0.8588, 0.4078, 0.5098pivotal tokens for high-level visual information via contribution degree across layers (see Sec.\ref{['subsec:pt']}). Our designed token contribution degree matrics distributes sparsely in the encoder's middle layers, thus we leverage this property to select the tokens representing high-level features; (2)Select complementary tokens for low-level visual information in the penultimate layer (see Sec.\ref{['subsec:ct']}). Via this module, we can further use the pivotal tokens as anchors to retrieve rgb]0.5725, 0.8157, 0.3137complementary tokens containing low-level feature information. In this way, we can realize the training-free token selection in a coarse-to-fine manner.
  • Figure 4: Distribution of token contribution degree $\mathbf{r}_l$ across different transformer layers. The consistently sparse distribution patterns demonstrate that only a small subset of tokens serves as global information aggregators at each layer, regardless of network depth. This sparsity property enables effective identification of pivotal tokens for high-level feature representation.
  • Figure 5: Pivotal tokens are found in areas of the image with dense information. They capture the high-level semantic information.
  • ...and 3 more figures