Table of Contents
Fetching ...

IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

Dong-Jae Lee, Sunghyun Baek, Junmo Kim

Abstract

Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token's key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token's information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.

IWP: Token Pruning as Implicit Weight Pruning in Large Vision Language Models

Abstract

Large Vision Language Models show impressive performance across image and video understanding tasks, yet their computational cost grows rapidly with the number of visual tokens. Existing token pruning methods mitigate this issue through empirical approaches while overlooking the internal mechanism of attention. In this paper, we propose a novel training free token pruning framework grounded in the dual form perspective of attention. We reformulate attention as an implicit linear layer whose weight matrix is the sum of rank 1 outer products, each generated by a single token's key value pair. Token pruning thus reduces to selecting an optimal subset of these rank 1 updates that best approximates the original dual weight matrix. Extending this perspective to standard softmax attention in LVLMs, we derive a novel metric quantifying both a token's information magnitude and information duplication. To efficiently select the subset with the proposed metric, we introduce Progressive Chunked Maximal Marginal Relevance. Extensive experiments demonstrate that our method achieves a better trade off between performance and efficiency, while providing another perspective on existing pruning approaches.

Paper Structure

This paper contains 48 sections, 13 equations, 4 figures, 12 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the Dual-Form Token Pruning framework. Softmax attention is reinterpreted as dual form via kernel mapping, where tokens generate rank-1 updates $\Delta \mathbf{W}_i = \phi(\mathbf{k}_i)^\top \mathbf{v}_i$. The Progressive Chunked MMR loop filters tokens based on information magnitude and duplication to efficiently approximate the dual weights. For visual clarity, the scalar normalization term $\eta_N(\mathbf{q})$ is omitted.
  • Figure C.1: Token similarity visualization in LLaVA-OneVision-7B. We visualize the similarity between visual and text tokens extracted from Layer 4 across three metrics defined in Eq. 12. For visual clarity, diagonal elements are masked and we use the last 100 tokens of the entire sequence.
  • Figure C.2: Consistency of selected tokens across layers. We visualize the mean Intersection over Union (mIoU) of the visual token subsets selected by our pruning method at different transformer layers.
  • Figure C.3: Qualitative visualization of token pruning. We compare the generated descriptions from the unpruned baseline model with those from our proposed token pruning method under varying token budgets. Corresponding expressions are highlighted in the same color for clarity.