Table of Contents
Fetching ...

What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph

Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou

TL;DR

This work addresses the high computational cost of multimodal LLMs arising from large visual token sequences. It introduces G-Prune, a training-free graph-based visual token pruning method that treats visual tokens as nodes connected by a cosine-similarity graph and uses iterative information propagation to identify representative tokens across both foreground and background regions. The approach substantially reduces FLOPs (up to approximately 63% in reported cases) while maintaining or improving performance across eight VL benchmarks, outperforming several existing pruning methods, and demonstrating robustness via ablations. The findings enable efficient high-resolution multimodal reasoning in MLLMs, reducing compute without sacrificing practical accuracy in real-world tasks such as TextVQA and OCR-rich evaluations.

Abstract

Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed G-Prune.In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background.To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of benchmarks.The experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks. For instance, G-Prune can reduce 63.57\% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA with only 0.95\% and 2.34\% accuracy drops, respectively.

What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph

TL;DR

This work addresses the high computational cost of multimodal LLMs arising from large visual token sequences. It introduces G-Prune, a training-free graph-based visual token pruning method that treats visual tokens as nodes connected by a cosine-similarity graph and uses iterative information propagation to identify representative tokens across both foreground and background regions. The approach substantially reduces FLOPs (up to approximately 63% in reported cases) while maintaining or improving performance across eight VL benchmarks, outperforming several existing pruning methods, and demonstrating robustness via ablations. The findings enable efficient high-resolution multimodal reasoning in MLLMs, reducing compute without sacrificing practical accuracy in real-world tasks such as TextVQA and OCR-rich evaluations.

Abstract

Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed G-Prune.In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background.To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of benchmarks.The experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks. For instance, G-Prune can reduce 63.57\% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA with only 0.95\% and 2.34\% accuracy drops, respectively.
Paper Structure (13 sections, 6 equations, 8 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 6 equations, 8 figures, 3 tables, 1 algorithm.

Figures (8)

  • Figure 2: The overview of G-Prune. G-Prune aims to find out the important visual tokens for MLLMs, thereby reducing the computation complexity. In practice, G-Prune regards all visual tokens as graph nodes, and build their connections based on their semantic similarities. Afterwards, an iterative algorithm is performed to propagate information among nodes and upgrade the importance scores of visual tokens. After iterations, we can select the top-$k$ tokens for MLLMs, which could be both foreground and background ones.
  • Figure 3: Comparison between our G-Prune and other compression methods for the LLaVA-NeXT model tested on the TextVQA, DocVQA, POPE and GQA benchmarks.
  • Figure 4: Comparison between our G-Prune method and $l2$-Norm based pruning method. The result is based on the average performance across GQA, POPE and TextVQA.
  • Figure 5: The visualization of information propagation in G-Prune across different iteration num $t$. The heatmaps demonstrates the score of each visual token on LLaVA-NeXT for TextVQA.
  • Figure 6: The comparative visualization of ToMe, FastV, and G-Prune on LLaVA-NeXT. G-Prune effectively retains tokens representative of regions with high information content, giving it a significant advantage in fine-grained tasks.
  • ...and 3 more figures