Table of Contents
Fetching ...

All You Need Are Random Visual Tokens? Demystifying Token Pruning in VLLMs

Yahong Wang, Juncheng Wu, Zhangkai Ni, Longzhen Yang, Yihang Liu, Chengmei Yang, Ying Wen, Xianfeng Tang, Hui Liu, Yuyin Zhou, Lianghua He

TL;DR

This work examines why token pruning in Vision-Language Models often stalls at deep decoder layers, revealing an information horizon where visual tokens lose salience. By defining an information score that measures the impact of removing each token on the model's output, the authors show that token information progressively vanishes with depth and is modulated by task complexity and model capability. The key contribution is a dynamic, information-centric view of token pruning that justifies using random pruning in deep layers and demonstrates that combining random pruning with existing methods yields superior efficiency-accuracy tradeoffs. The results establish practical guidelines for pruning VLLMs and offer state-of-the-art improvements (e.g., DivPrune+Random) while ensuring substantial speedups, with code to be released publicly.

Abstract

Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by "vanishing token information", where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon", beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at https://github.com/YahongWang1/Information-Horizon.

All You Need Are Random Visual Tokens? Demystifying Token Pruning in VLLMs

TL;DR

This work examines why token pruning in Vision-Language Models often stalls at deep decoder layers, revealing an information horizon where visual tokens lose salience. By defining an information score that measures the impact of removing each token on the model's output, the authors show that token information progressively vanishes with depth and is modulated by task complexity and model capability. The key contribution is a dynamic, information-centric view of token pruning that justifies using random pruning in deep layers and demonstrates that combining random pruning with existing methods yields superior efficiency-accuracy tradeoffs. The results establish practical guidelines for pruning VLLMs and offer state-of-the-art improvements (e.g., DivPrune+Random) while ensuring substantial speedups, with code to be released publicly.

Abstract

Vision Large Language Models (VLLMs) incur high computational costs due to their reliance on hundreds of visual tokens to represent images. While token pruning offers a promising solution for accelerating inference, this paper, however, identifies a key observation: in deeper layers (e.g., beyond the 20th), existing training-free pruning methods perform no better than random pruning. We hypothesize that this degradation is caused by "vanishing token information", where visual tokens progressively lose their salience with increasing network depth. To validate this hypothesis, we quantify a token's information content by measuring the change in the model output probabilities upon its removal. Using this proposed metric, our analysis of the information of visual tokens across layers reveals three key findings: (1) As layers deepen, the information of visual tokens gradually becomes uniform and eventually vanishes at an intermediate layer, which we term as "information horizon", beyond which the visual tokens become redundant; (2) The position of this horizon is not static; it extends deeper for visually intensive tasks, such as Optical Character Recognition (OCR), compared to more general tasks like Visual Question Answering (VQA); (3) This horizon is also strongly correlated with model capacity, as stronger VLLMs (e.g., Qwen2.5-VL) employ deeper visual tokens than weaker models (e.g., LLaVA-1.5). Based on our findings, we show that simple random pruning in deep layers efficiently balances performance and efficiency. Moreover, integrating random pruning consistently enhances existing methods. Using DivPrune with random pruning achieves state-of-the-art results, maintaining 96.9% of Qwen-2.5-VL-7B performance while pruning 50% of visual tokens. The code will be publicly available at https://github.com/YahongWang1/Information-Horizon.

Paper Structure

This paper contains 25 sections, 6 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Existing token pruning methods exhibit similar performance to random pruning at deeper layers. We compare various pruning methods on LLaVA-1.5-7B model and three benchmarks, with 90% of visual tokens are pruned within a given language decoder layer.
  • Figure 2: We compare various pruning methods on Qwen-2.5-VL-7B model using the MME and TextVQA benchmarks. At each decoder layer, 87.5% of the visual tokens are removed.
  • Figure 3: Tasks with different visual complexity. Low visual complexity tasks only require the VLLM to identify global information such as the main scene, while tasks with higher visual complexity require the model to focus on visual details.
  • Figure 4: Illustration of our framework for computing visual token information. At the $i$-th layer of VLLM's language decoder, we firstly remove all other visual tokens except the target one and run one forward pass. Next we additionally run another forward pass by further removing the only one visual token. The difference between these two output probabilities on the ground-truth label defines the information score $\text{I}_i(\mathbf{V}_k)$.
  • Figure 5: Evaluation of various pruning methods. We measure the sum of information in retained visual tokens when using different pruning methods. In the deep layers, existing pruning methods fail to retain more high-information than random pruning.
  • ...and 7 more figures