What Kind of Visual Tokens Do We Need? Training-free Visual Token Pruning for Multi-modal Large Language Models from the Perspective of Graph
Yutao Jiang, Qiong Wu, Wenhao Lin, Wei Yu, Yiyi Zhou
TL;DR
This work addresses the high computational cost of multimodal LLMs arising from large visual token sequences. It introduces G-Prune, a training-free graph-based visual token pruning method that treats visual tokens as nodes connected by a cosine-similarity graph and uses iterative information propagation to identify representative tokens across both foreground and background regions. The approach substantially reduces FLOPs (up to approximately 63% in reported cases) while maintaining or improving performance across eight VL benchmarks, outperforming several existing pruning methods, and demonstrating robustness via ablations. The findings enable efficient high-resolution multimodal reasoning in MLLMs, reducing compute without sacrificing practical accuracy in real-world tasks such as TextVQA and OCR-rich evaluations.
Abstract
Recent Multimodal Large Language Models(MLLMs) often use a large number of visual tokens to compensate their visual shortcoming, leading to excessive computation and obvious visual redundancy. In this paper, we investigate what kind of visual tokens are needed for MLLMs, and reveal that both foreground and background tokens are critical for MLLMs given the varying difficulties of examples. Based on this observation, we propose a graph-based method towards training-free visual token pruning, termed G-Prune.In particular, G-Prune regards visual tokens as nodes, and construct their connections based on their semantic similarities. Afterwards, the information flow is propagated via weighted links, and the most important tokens after iterations are kept for MLLMs, which can be front or background.To validate G-Prune, we apply it to a recent MLLM called LLaVA-NeXT, and conduct extensive experiments on a set of benchmarks.The experiment results show that G-Prune can greatly reduce computation overhead while retaining high performance on both coarse- and fine-grained tasks. For instance, G-Prune can reduce 63.57\% FLOPs of LLaVA-NeXT on VQA2.0 and TextVQA with only 0.95\% and 2.34\% accuracy drops, respectively.
