Table of Contents
Fetching ...

EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models

Yahong Wang, Juncheng Wu, Zhangkai Ni, Chengmei Yang, Yihang Liu, Longzhen Yang, Yuyin Zhou, Ying Wen, Lianghua He

TL;DR

This work proposes EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps, and exploits the spectral equivalence of dual Gram matrices to enable efficient computation.

Abstract

Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.

EntropyPrune: Matrix Entropy Guided Visual Token Pruning for Multimodal Large Language Models

TL;DR

This work proposes EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps, and exploits the spectral equivalence of dual Gram matrices to enable efficient computation.

Abstract

Multimodal large language models (MLLMs) incur substantial inference cost due to the processing of hundreds of visual tokens per image. Although token pruning has proven effective for accelerating inference, determining when and where to prune remains largely heuristic. Existing approaches typically rely on static, empirically selected layers, which limit interpretability and transferability across models. In this work, we introduce a matrix-entropy perspective and identify an "Entropy Collapse Layer" (ECL), where the information content of visual representations exhibits a sharp and consistent drop, which provides a principled criterion for selecting the pruning stage. Building on this observation, we propose EntropyPrune, a novel matrix-entropy-guided token pruning framework that quantifies the information value of individual visual tokens and prunes redundant ones without relying on attention maps. Moreover, to enable efficient computation, we exploit the spectral equivalence of dual Gram matrices, reducing the complexity of entropy computation and yielding up to a 64x theoretical speedup. Extensive experiments on diverse multimodal benchmarks demonstrate that EntropyPrune consistently outperforms state-of-the-art pruning methods in both accuracy and efficiency. On LLaVA-1.5-7B, our method achieves a 68.2% reduction in FLOPs while preserving 96.0% of the original performance. Furthermore, EntropyPrune generalizes effectively to high-resolution and video-based models, highlighting the strong robustness and scalability in practical MLLM acceleration. The code will be publicly available at https://github.com/YahongWang1/EntropyPrune.
Paper Structure (32 sections, 35 equations, 5 figures, 7 tables)

This paper contains 32 sections, 35 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: (a) Comparison between vanilla LLaVA-1.5-7B and EntropyPrune. Correct answers are highlighted in green, while hallucinations are marked in red. By removing low-information tokens, EntropyPrune encourages the model to concentrate on more critical details (e.g., the person's state and the car's color). (b) Performance comparison. The radial-axis visualization of min-max normalized scores shows that EntropyPrune consistently outperforms state-of-the-art models, including FastV, DART, DivPrune, and CDPruner.
  • Figure 2: Layer-wise matrix entropy of visual tokens (query and key states) in LLaVA-1.5-7B and LLaVA-Next-7B across eight datasets. A consistent layer-wise trend is observed across different datasets, with a precipitous entropy drop after the second layer.
  • Figure 3: Eigenvalue distribution of the query and key states covariance matrices. L$i$ represents the $i-th$ layer. We visualize the magnitude of eigenvalues across different layers of LLaVA-1.5-7B. The rapid decay observed in the eigenvalues distribution indicates a low-rank structure within the matrices.
  • Figure 4: Overview of the proposed EntropyPrune. (a) When to Prune identifies the "Entropy Collapse Layer" by detecting a sharp drop in layer-wise matrix entropy. (b) What to Prune details the head-wise reshaping mechanism and the calculation of matrix entropy based on the token's covariance matrix. (c) EntropyPrune Pipeline demonstrates the overall workflow where the matrix entropy of each visual token is calculated after the "Entropy Collapse Layer" to prune low-entropy tokens.
  • Figure 5: Ablation study on pruning layer selection. Experiments are conducted on TextVQA and MMB when retaining 192 tokens. Applying pruning at Entropy Collapse Layer (Layer 2) consistently yields the best performance across all baselines.