Table of Contents
Fetching ...

TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model

Ao Li, Yuxiang Duan, Jinghui Zhang, Congbo Ma, Yutong Xie, Gustavo Carneiro, Mohammad Yaqub, Hu Wang

TL;DR

TransPrune introduces Token Transition Variation (TTV) and Instruction-Guided Attention (IGA) as complementary, training-free criteria to identify important visual tokens in large vision-language models. By accumulating TTV across middle layers and combining it with IGA via a simple score, the method achieves substantial inference efficiency—reducing TFLOPs by over 50%—with little to no loss in multimodal performance across multiple LVLMs and benchmarks. The approach addresses the limitations of attention-based token importance, such as positional bias, and demonstrates compatibility with projector-based pruning methods like VisionZiP and CDPruner. Empirically, TTV alone also serves as a strong signal, and the combination with IGA yields robust performance, making TransPrune a practical, training-free solution for accelerating LVLM inference. These findings suggest a new direction for token pruning that leverages dynamic token transitions to capture semantic importance, especially in the middle layers where representations balance global and local information.

Abstract

Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at https://github.com/liaolea/TransPrune.

TransPrune: Token Transition Pruning for Efficient Large Vision-Language Model

TL;DR

TransPrune introduces Token Transition Variation (TTV) and Instruction-Guided Attention (IGA) as complementary, training-free criteria to identify important visual tokens in large vision-language models. By accumulating TTV across middle layers and combining it with IGA via a simple score, the method achieves substantial inference efficiency—reducing TFLOPs by over 50%—with little to no loss in multimodal performance across multiple LVLMs and benchmarks. The approach addresses the limitations of attention-based token importance, such as positional bias, and demonstrates compatibility with projector-based pruning methods like VisionZiP and CDPruner. Empirically, TTV alone also serves as a strong signal, and the combination with IGA yields robust performance, making TransPrune a practical, training-free solution for accelerating LVLM inference. These findings suggest a new direction for token pruning that leverages dynamic token transitions to capture semantic importance, especially in the middle layers where representations balance global and local information.

Abstract

Large Vision-Language Models (LVLMs) have advanced multimodal learning but face high computational costs due to the large number of visual tokens, motivating token pruning to improve inference efficiency. The key challenge lies in identifying which tokens are truly important. Most existing approaches rely on attention-based criteria to estimate token importance. However, they inherently suffer from certain limitations, such as positional bias. In this work, we explore a new perspective on token importance based on token transitions in LVLMs. We observe that the transition of token representations provides a meaningful signal of semantic information. Based on this insight, we propose TransPrune, a training-free and efficient token pruning method. Specifically, TransPrune progressively prunes tokens by assessing their importance through a combination of Token Transition Variation (TTV)-which measures changes in both the magnitude and direction of token representations-and Instruction-Guided Attention (IGA), which measures how strongly the instruction attends to image tokens via attention. Extensive experiments demonstrate that TransPrune achieves comparable multimodal performance to original LVLMs, such as LLaVA-v1.5 and LLaVA-Next, across eight benchmarks, while reducing inference TFLOPs by more than half. Moreover, TTV alone can serve as an effective criterion without relying on attention, achieving performance comparable to attention-based methods. The code will be made publicly available upon acceptance of the paper at https://github.com/liaolea/TransPrune.

Paper Structure

This paper contains 12 sections, 7 equations, 5 figures, 13 tables.

Figures (5)

  • Figure 1: Comparison with existing pruning methods on LLaVA-v1.5-7B. Among within-LLM pruning approaches, TransPrune achieves the best performance across six benchmarks under the lowest TFLOPs budget.
  • Figure 2: Token Transition Visualization in LLaVA-v1.5-7B. We visualize the magnitude and direction changes of token representations within both the self-attention and FFN modules for each layer (excluding residual connections). To measure the magnitude change, we use the ratio of output to input L2 norm; to measure the directional change, we use cosine similarity. Token transitions that reflect semantic importance can be observed across shallow, middle, and deep layers, and they are most concentrated and pronounced in the middle layers (around layers 6–14), where tokens with larger ratios and smaller absolute cosine similarities tend to be more semantically important. We provide more visualization examples in supplementary material.
  • Figure 3: (a) Overview of TransPrune. During pruning, TransPrune computes image token transitions. Tokens whose transitions are closer in magnitude to those of the original tokens, and that exhibit more orthogonal directional changes, are assigned higher TTV scores. In parallel, we compute IGA by averaging the attention from instruction tokens to image tokens. The final score for each token is obtained by summing TTV and IGA, followed by sorting. (b) Accumulation of TTV. To achieve a more precise TTV, we retain TTV scores from earlier layers. For each pruning stage, we accumulate TTV scores from the first accumulated layer up to the current pruning layer.
  • Figure 4: Token position frequency statistics on MME benchmark for IGA and TTV.
  • Figure 5: Visualization of TransPrune on different VQA prompts.