ToDRE: Effective Visual Token Pruning via Token Diversity and Task Relevance
Duo Li, Zuhao Yang, Xiaoqin Zhang, Ling Shao, Shijian Lu
TL;DR
ToDRE tackles the heavy cost of visual token processing in LVLMs by identifying two orthogonal sources of redundancy: intra-modal visual token diversity and cross-modal task relevance. It proposes a two-stage, training-free framework that first preserves a diverse subset of visual tokens via greedy max-sum diversification and then prunes remaining tokens at a late decoder layer guided by cross-modal attention, leveraging information migration in the LLM. The approach yields substantial efficiency gains (up to 2.6x speed-up and up to 90% token pruning) while maintaining around 95% of the original performance across image and video benchmarks and showing strong transferability across backbones and tasks. Overall, ToDRE offers a practical, model-agnostic path to lighter, faster LVLM inference with minimal accuracy loss, supported by theoretical orthogonality of redundancy and extensive empirical validation.
Abstract
Visual token pruning aims to compress and prune redundant visual tokens which play a critical role in efficient inference with large vision-language models (LVLMs). However, most existing work estimates visual redundancy using a single metric, such as cross-modal attention or visual token similarity. We show that visual token diversity and task-specific token relevance are two crucial yet orthogonal factors that complement each other in conveying useful information and should therefore be treated separately for more effective visual token pruning. Building upon this insight, we design TODRE, a two-stage and training-free framework that incorporates Token Diversity and task RElevance for effective token compression and efficient LVLM inference. Instead of pruning redundant tokens, we introduce a greedy max-sum diversification algorithm that selects and retains a subset of diverse and representative visual tokens after the vision encoder. On top of that, ToDRE leverages an "information migration" mechanism to eliminate task-irrelevant visual tokens within certain decoder layers of large language model(LLM) to further improve token pruning and LVLM inference. Extensive experiments show that ToDRE prunes 90% of visual tokens after the vision encoder as well as all visual tokens in certain LLM decoder layers, leading to a 2.6x speed-up in total inference time while maintaining 95.0% model performance plus excellent model compatibility.
