Table of Contents
Fetching ...

IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs

Yifan Tan, Yifu Sun, Shirui Huang, Hong Liu, Guanghua Yu, Jianchen Zhu, Yangdong Deng

TL;DR

IDPruner addresses the high computational cost of visual tokens in multimodal LLMs by harmonizing token importance and semantic diversity via Maximal Marginal Relevance (MMR). It casts pruning as an information-retrieval re-ranking problem and uses a balanced objective $v^* = \arg\max_{v_i\in\mathcal{V}\setminus\mathcal{S}} [ \lambda \cdot \mathrm{Imp}(v_i) - (1-\lambda) \cdot \max_{v_j\in\mathcal{S}} \mathrm{Sim}(v_i,v_j) ]$, with $Imp$ normalized as $\mathrm{Imp}(v_i) = \frac{w_i-\min(\mathbf{w})}{\max(\mathbf{w})-\min(\mathbf{w})+\epsilon}$ and $\mathrm{Sim}(v_i,v_j) = \frac{v_i^\top v_j}{\|v_i\|\|v_j\|}$. This one-shot pruning is compatible with FlashAttention, avoids attention maps, and yields state-of-the-art results across multiple architectures and tasks, including $95.18\%$ average retention at 25% token retention on Qwen2.5-VL-7B-Instruct and robust video-language performance (e.g., $87.13\%$ on 75% pruning). IDPruner demonstrates strong cross-architecture generalization and practical deployment advantages, highlighting the importance of jointly optimizing token importance and semantic diversity.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities, yet they encounter significant computational bottlenecks due to the massive volume of visual tokens. Consequently, visual token pruning, which substantially reduces the token count, has emerged as a critical technique for accelerating MLLM inference. Existing approaches focus on token importance, diversity, or an intuitive combination of both, without a principled framework for their optimal integration. To address this issue, we first conduct a systematic analysis to characterize the trade-off between token importance and semantic diversity. Guided by this analysis, we propose the \textbf{I}mportance and \textbf{D}iversity Pruner (\textbf{IDPruner}), which leverages the Maximal Marginal Relevance (MMR) algorithm to achieve a Pareto-optimal balance between these two objectives. Crucially, our method operates without requiring attention maps, ensuring full compatibility with FlashAttention and efficient deployment via one-shot pruning. We conduct extensive experiments across various model architectures and multimodal benchmarks, demonstrating that IDPruner achieves state-of-the-art performance and superior generalization across diverse architectures and tasks. Notably, on Qwen2.5-VL-7B-Instruct, IDPruner retains 95.18\% of baseline performance when pruning 75\% of the tokens, and still maintains 86.40\% even under an extreme 90\% pruning ratio. Our code is available at https://github.com/Tencent/AngelSlim.

IDPruner: Harmonizing Importance and Diversity in Visual Token Pruning for MLLMs

TL;DR

IDPruner addresses the high computational cost of visual tokens in multimodal LLMs by harmonizing token importance and semantic diversity via Maximal Marginal Relevance (MMR). It casts pruning as an information-retrieval re-ranking problem and uses a balanced objective , with normalized as and . This one-shot pruning is compatible with FlashAttention, avoids attention maps, and yields state-of-the-art results across multiple architectures and tasks, including average retention at 25% token retention on Qwen2.5-VL-7B-Instruct and robust video-language performance (e.g., on 75% pruning). IDPruner demonstrates strong cross-architecture generalization and practical deployment advantages, highlighting the importance of jointly optimizing token importance and semantic diversity.

Abstract

Multimodal Large Language Models (MLLMs) have demonstrated impressive capabilities, yet they encounter significant computational bottlenecks due to the massive volume of visual tokens. Consequently, visual token pruning, which substantially reduces the token count, has emerged as a critical technique for accelerating MLLM inference. Existing approaches focus on token importance, diversity, or an intuitive combination of both, without a principled framework for their optimal integration. To address this issue, we first conduct a systematic analysis to characterize the trade-off between token importance and semantic diversity. Guided by this analysis, we propose the \textbf{I}mportance and \textbf{D}iversity Pruner (\textbf{IDPruner}), which leverages the Maximal Marginal Relevance (MMR) algorithm to achieve a Pareto-optimal balance between these two objectives. Crucially, our method operates without requiring attention maps, ensuring full compatibility with FlashAttention and efficient deployment via one-shot pruning. We conduct extensive experiments across various model architectures and multimodal benchmarks, demonstrating that IDPruner achieves state-of-the-art performance and superior generalization across diverse architectures and tasks. Notably, on Qwen2.5-VL-7B-Instruct, IDPruner retains 95.18\% of baseline performance when pruning 75\% of the tokens, and still maintains 86.40\% even under an extreme 90\% pruning ratio. Our code is available at https://github.com/Tencent/AngelSlim.
Paper Structure (25 sections, 6 equations, 5 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 6 equations, 5 figures, 7 tables, 1 algorithm.

Figures (5)

  • Figure 1: Performance comparison across four architectures and eight benchmarks. IDPruner (outermost boundary) consistently outperforms baselines in both (a) aggregated performance across four diverse MLLM architectures and (b) fine-grained benchmark breakdown for Qwen2.5-VL. This demonstrates the superior cross-architecture generalization and task-specific robustness of our method.
  • Figure 2: Overview of the IDPruner framework.Left: Integration of our one-shot visual token pruning into the MLLM inference pipeline. Right: The core mechanism computes Importance Scores (Red) and a Similarity Matrix (Blue), utilizing an MMR selection process to harmonize importance and diversity. This approach operates without attention maps and remains compatible with FlashAttention.
  • Figure 3: Pareto Frontier Analysis. We visualize the trade-off between the Hopkins Statistic ($H$) and the Importance Retention Ratio ($\mathcal{I}$). The ideal pruning strategy should approach the top-left corner, achieving a high Importance Retention Ratio ($\mathcal{I} \to 1$) while minimizing the Hopkins Statistic ($H \to 0$). The MMR mechanism (Orange) constructs a superior Pareto frontier that strictly dominates the Naive Hybrid strategy (Purple) and envelopes the DPP solution (Green).
  • Figure 4: Visualization of retained visual tokens across different samples from MMBench. Columns from left to right: Original Image, DivPrune, VisionSelector, and IDPruner. DivPrune maintains global coverage but often neglects the semantic subject. VisionSelector clusters heavily on salient objects, resulting in redundancy and background loss. IDPruner achieves a superior balance, preserving intricate details of the subject while maintaining essential background context for global reasoning.
  • Figure 5: Distribution of pairwise angles between visual tokens. We calculated the angles for all token pairs across 100 images from MMBench using Qwen2.5-VL-7B. The distribution is entirely concentrated within the acute angle range ($< 90^\circ$), peaking around $74^\circ$. The absence of obtuse angles ($> 90^\circ$, right of the red dashed line) guarantees that the cosine similarity metric remains strictly non-negative.

Theorems & Definitions (3)

  • Definition 1: Visual Token Pruning
  • Definition 2: Importance Retention Ratio
  • Definition 3: Diversity Metric via Hopkins Statistic