Table of Contents
Fetching ...

TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models

Hao Zhang, Mengsi Lyu, Chenrui He, Yulong Ao, Yonghua Lin

TL;DR

This paper tackles the computational bottleneck of large multimodal models by pruning visual tokens rather than textual ones. It introduces a two-stage pruning pipeline: first select visually aligned tokens using a mutual-information-based cross-modal criterion (approximated via an L2 norm), then apply a greedy RepMax procedure to maximize intra-modal diversity among the retained tokens. The approach yields substantial efficiency gains—up to 88.9% visual token reduction and over 50% faster inference in several settings—while maintaining strong task performance across diverse benchmarks and model sizes. The results show that cross-modal alignment and intra-modal diversity are key to preserving semantic integrity under aggressive token pruning, with implications for scalable deployment of LMMs in resource-constrained environments.

Abstract

Large Multimodal Models (LMMs) have achieved significant success across various tasks. These models usually encode visual inputs into dense token sequences, which are then concatenated with textual tokens and jointly processed by a language model. However, the increased token count substantially raises computational and memory costs during inference. Token pruning has emerged as a promising approach to address this issue. Existing token pruning methods often rely on costly calibration or suboptimal importance metrics, leading to redundant retained tokens. In this paper, we analyze the redundancy differences between visual and textual tokens and propose pruning exclusively on visual tokens. Based on this, we propose a visual token pruning strategy that explicitly preserves both cross-modal alignment and intra-modal informational diversity. We introduce a mutual information-based token pruning strategy that removes visual tokens semantically misaligned with textual tokens, effectively preserving the alignment between the visual and textual modalities. To further improve the representational quality of the retained tokens, we additionally prune redundant visual tokens by maximizing the expected pairwise distances in the embedding space, which is solved efficiently with a greedy algorithm. Extensive experiments demonstrate that our method maintains strong performance while reducing tokens by 88.9% on models such as LLaVA-1.5-7B and LLaVA-NEXT-7B, resulting in a 56.7% improvement in inference speed.

TrimTokenator: Towards Adaptive Visual Token Pruning for Large Multimodal Models

TL;DR

This paper tackles the computational bottleneck of large multimodal models by pruning visual tokens rather than textual ones. It introduces a two-stage pruning pipeline: first select visually aligned tokens using a mutual-information-based cross-modal criterion (approximated via an L2 norm), then apply a greedy RepMax procedure to maximize intra-modal diversity among the retained tokens. The approach yields substantial efficiency gains—up to 88.9% visual token reduction and over 50% faster inference in several settings—while maintaining strong task performance across diverse benchmarks and model sizes. The results show that cross-modal alignment and intra-modal diversity are key to preserving semantic integrity under aggressive token pruning, with implications for scalable deployment of LMMs in resource-constrained environments.

Abstract

Large Multimodal Models (LMMs) have achieved significant success across various tasks. These models usually encode visual inputs into dense token sequences, which are then concatenated with textual tokens and jointly processed by a language model. However, the increased token count substantially raises computational and memory costs during inference. Token pruning has emerged as a promising approach to address this issue. Existing token pruning methods often rely on costly calibration or suboptimal importance metrics, leading to redundant retained tokens. In this paper, we analyze the redundancy differences between visual and textual tokens and propose pruning exclusively on visual tokens. Based on this, we propose a visual token pruning strategy that explicitly preserves both cross-modal alignment and intra-modal informational diversity. We introduce a mutual information-based token pruning strategy that removes visual tokens semantically misaligned with textual tokens, effectively preserving the alignment between the visual and textual modalities. To further improve the representational quality of the retained tokens, we additionally prune redundant visual tokens by maximizing the expected pairwise distances in the embedding space, which is solved efficiently with a greedy algorithm. Extensive experiments demonstrate that our method maintains strong performance while reducing tokens by 88.9% on models such as LLaVA-1.5-7B and LLaVA-NEXT-7B, resulting in a 56.7% improvement in inference speed.

Paper Structure

This paper contains 25 sections, 19 equations, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Overview of the visual token pruning method. We compute the average mutual information (MI) between each visual token and textual tokens to obtain its semantic alignment score $\alpha_i$, and preserve the highest score tokens to form $\mathcal{X}^{v(1)}$. This subset is further refined by Greedy RepMax, which prunes redundant tokens to yield $\mathcal{X}^{v(2)}$. Greedy RepMax is a greedy approximation to the NP Hard problem of maximizing the expected pairwise distance among visual tokens.
  • Figure 2: Case study comparing captions generated by dense and pruned models (LLaVA-1.5-7B) on the COCO dataset, demonstrating output consistency despite substantial token reduction.