Table of Contents
Fetching ...

ToDRE: Effective Visual Token Pruning via Token Diversity and Task Relevance

Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu

TL;DR

ToDRE tackles the heavy cost of visual token processing in LVLMs by identifying two orthogonal sources of redundancy: intra-modal visual token diversity and cross-modal task relevance. It proposes a two-stage, training-free framework that first preserves a diverse subset of visual tokens via greedy max-sum diversification and then prunes remaining tokens at a late decoder layer guided by cross-modal attention, leveraging information migration in the LLM. The approach yields substantial efficiency gains (up to 2.6x speed-up and up to 90% token pruning) while maintaining around 95% of the original performance across image and video benchmarks and showing strong transferability across backbones and tasks. Overall, ToDRE offers a practical, model-agnostic path to lighter, faster LVLM inference with minimal accuracy loss, supported by theoretical orthogonality of redundancy and extensive empirical validation.

Abstract

Visual token pruning aims to compress and prune redundant visual tokens which play a critical role in efficient inference with large vision-language models (LVLMs). However, most existing work estimates visual redundancy using a single metric, such as cross-modal attention or visual token similarity. We show that visual token diversity and task-specific token relevance are two crucial yet orthogonal factors that complement each other in conveying useful information and should therefore be treated separately for more effective visual token pruning. Building upon this insight, we design TODRE, a two-stage and training-free framework that incorporates Token Diversity and task RElevance for effective token compression and efficient LVLM inference. Instead of pruning redundant tokens, we introduce a greedy max-sum diversification algorithm that selects and retains a subset of diverse and representative visual tokens after the vision encoder. On top of that, ToDRE leverages an "information migration" mechanism to eliminate task-irrelevant visual tokens within certain decoder layers of large language model(LLM) to further improve token pruning and LVLM inference. Extensive experiments show that ToDRE prunes 90% of visual tokens after the vision encoder as well as all visual tokens in certain LLM decoder layers, leading to a 2.6x speed-up in total inference time while maintaining 95.0% model performance plus excellent model compatibility.

ToDRE: Effective Visual Token Pruning via Token Diversity and Task Relevance

TL;DR

ToDRE tackles the heavy cost of visual token processing in LVLMs by identifying two orthogonal sources of redundancy: intra-modal visual token diversity and cross-modal task relevance. It proposes a two-stage, training-free framework that first preserves a diverse subset of visual tokens via greedy max-sum diversification and then prunes remaining tokens at a late decoder layer guided by cross-modal attention, leveraging information migration in the LLM. The approach yields substantial efficiency gains (up to 2.6x speed-up and up to 90% token pruning) while maintaining around 95% of the original performance across image and video benchmarks and showing strong transferability across backbones and tasks. Overall, ToDRE offers a practical, model-agnostic path to lighter, faster LVLM inference with minimal accuracy loss, supported by theoretical orthogonality of redundancy and extensive empirical validation.

Abstract

Visual token pruning aims to compress and prune redundant visual tokens which play a critical role in efficient inference with large vision-language models (LVLMs). However, most existing work estimates visual redundancy using a single metric, such as cross-modal attention or visual token similarity. We show that visual token diversity and task-specific token relevance are two crucial yet orthogonal factors that complement each other in conveying useful information and should therefore be treated separately for more effective visual token pruning. Building upon this insight, we design TODRE, a two-stage and training-free framework that incorporates Token Diversity and task RElevance for effective token compression and efficient LVLM inference. Instead of pruning redundant tokens, we introduce a greedy max-sum diversification algorithm that selects and retains a subset of diverse and representative visual tokens after the vision encoder. On top of that, ToDRE leverages an "information migration" mechanism to eliminate task-irrelevant visual tokens within certain decoder layers of large language model(LLM) to further improve token pruning and LVLM inference. Extensive experiments show that ToDRE prunes 90% of visual tokens after the vision encoder as well as all visual tokens in certain LLM decoder layers, leading to a 2.6x speed-up in total inference time while maintaining 95.0% model performance plus excellent model compatibility.

Paper Structure

This paper contains 55 sections, 1 theorem, 19 equations, 6 figures, 8 tables, 1 algorithm.

Key Result

Lemma 1

If $\mathcal{V}\perp\mathcal{T}$, then for any $v_i,v_j\in\mathcal{V}$ and $v_k\in\mathcal{V}$,

Figures (6)

  • Figure 1: (a–c): Different from the prevalent visual token pruning approach chen2024fastvzhang2024fastervlm that overly relies on attention scores, the proposed ToDRE incorporates token diversity and task relevance, two largely neglected yet critical factors that help preserve indispensable and informative visual cues and improve pruning robustness and answer accuracy as illustrated in the coffee cup localization task. (d): Quantitative experiments over eight image-language comprehension benchmarks demonstrate the superior and consistent effectiveness of our proposed ToDRE.
  • Figure 1: Output token’s attention toward different input token types across LLM layers during decoding. Results are averaged over 100 samples per benchmark.
  • Figure 2: Text-to-visual attention (blue) and visual-to-text attention (orange) in each LLM decoder layer. We observe a clear pattern of "information migration": cross-modal attention (both visual-to-text and text-to-visual) is high in early layers, reflecting active information exchange, but gradually diminishes in deeper layers as the model shifts toward unimodal text reasoning.
  • Figure 2: Qualitative comparison of free-form video-grounded QA on the Video Detail Caption benchmark chai2025videodetailcaption.Green text highlights correctly identified events and objects; red text indicates incorrect predictions; yellow text marks missing but essential information.
  • Figure 3: Overall framework of ToDRE. Given the visual and textual inputs, the proposed Diversity-driven Token Selection first selects a pivot token from global thumbnail or video frames with [CLS]-based attention and then performs max-sum diversification to retain a diverse set of $k$ visual tokens. The proposed Relevance-driven Token Reduction then dynamically identifies a pivot decoder layer and prunes all its visual tokens—the layer is identified if its visual-to-text and text-to-visual attention ratios both fall below a predefined threshold $\tau$. $E^G_v$, $E^C_v$, and $E^F_v$ denote the embeddings of thumbnail, local crops, and video frames, respectively.
  • ...and 1 more figures

Theorems & Definitions (1)

  • Lemma 1: Sub-space independence