Table of Contents
Fetching ...

ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

An Yu, Ting Yu Tsai, Zhenfei Zhang, Weiheng Lu, Felix X. -F. Ye, Ming-Ching Chang

Abstract

Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present ReDiPrune, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15% of visual tokens yields a +2.0% absolute accuracy gain while reducing computation by more than $6\times$ in TFLOPs. Code is available at https://github.com/UA-CVML/ReDiPrune.

ReDiPrune: Relevance-Diversity Pre-Projection Token Pruning for Efficient Multimodal LLMs

Abstract

Recent multimodal large language models are computationally expensive because Transformers must process a large number of visual tokens. We present ReDiPrune, a training-free token pruning method applied before the vision-language projector, where visual features remain rich and discriminative. Unlike post-projection pruning methods that operate on compressed representations, ReDiPrune selects informative tokens directly from vision encoder outputs, preserving fine-grained spatial and semantic cues. Each token is scored by a lightweight rule that jointly consider text-conditioned relevance and max-min diversity, ensuring the selected tokens are both query-relevant and non-redundant. ReDiPrune is fully plug-and-play, requiring no retraining or architectural modifications, and can be seamlessly inserted between the encoder and projector. Across four video and five image benchmarks, it consistently improves the accuracy-efficiency trade-off. For example, on EgoSchema with LLaVA-NeXT-Video-7B, retaining only 15% of visual tokens yields a +2.0% absolute accuracy gain while reducing computation by more than in TFLOPs. Code is available at https://github.com/UA-CVML/ReDiPrune.

Paper Structure

This paper contains 34 sections, 12 equations, 8 figures, 8 tables, 2 algorithms.

Figures (8)

  • Figure 1: Comparison of pruning strategies in MLLMs: (a) Post-projection pruning selects diverse tokens but ignores text and loses fine details. (b) Text-guided post-projection pruning improves query relevance but discards original visual cues. (c) ReDiPrune prunes tokens before projection using weighted query embeddings, preserving details while balancing accuracy and efficiency.
  • Figure 2: Overview of ReDiPrune: Given a prompt $p$, we build a normalized weighted query vector $\hat{q}$ ($\S$\ref{['sec:query_embedding']}). For each frame $I^{(n)}$, the vision encoder produces tokens $E_V^{(n)}$. ReDiPrune operates before the projector $P(\cdot)$ in the visual feature space, computing text relevance $s_i$ and cosine dissimilarity $D_{ij}$ ($\S$\ref{['sec:scoring']}). With a per-frame budget $k_n$, we greedily select $k_n$ tokens by optimizing a relevance-diversity score $\min_{u\in S} D_{ui} + \alpha s_i$ ($\S$\ref{['sec:greedy_solver']}), where $u$ indexes tokens already selected in $S$. The retained tokens $\tilde{E}_V^{(n)}$ are projected as $Z_V^{(n)}$ and concatenated with text embeddings for decoding by the LLM $f_\theta$, reducing redundancy while preserving query-relevant information.
  • Figure 3: Qualitative examples from the TGIF dataset Jang2017TGIF using Video-LLaVA-7B videollava. For each question, we show the ground-truth (GT) answer and responses from the original model, DivPrune Alvar2025divprune, CDPruner zhang2025CDPruner, and ReDiPrune. ReDiPrune accurately captures action cues, demonstrating stronger semantic grounding and temporal understanding compared with competing pruning methods.
  • Figure 4: Sharper, query-aligned attention with ReDiPrune. Frame-wise attention visualization and distribution for different pruning methods on a NextQA xiao2021next sample. (Left) Each method (Original, DivPrune Alvar2025divprune, CDPruner zhang2025CDPruner, and ReDiPrune) processes the same query; eight frames (f0-f7) and their attention scores (in $\times 10^{-3}$) are shown in descending order of importance. (Right) Histogram of attention across frames. ReDiPrune produces sharper, query-focused attention on the most relevant frames and a more concentrated distribution, indicating stronger alignment with query.
  • Figure 5: Comparison of pruning strategies in MLLMs: (a) Post-projection pruning selects diverse tokens but ignores text and loses fine details. (b) Text-guided post-projection pruning improves query relevance but discards original visual cues. (c) ReDiPrune prunes tokens before projection using weighted query embeddings, preserving details while balancing accuracy and efficiency.
  • ...and 3 more figures