QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA
Shuai Li, Jian Xu, Xiao-Hui Li, Chao Deng, Lin-Lin Huang
TL;DR
The paper tackles the high memory and computation cost of open-world VQA with Multimodal LLMs by introducing QG-VTC, a question-guided visual token compression method that operates inside the vision encoder in a hierarchical, multi-step fashion. It embeds the user question into the vision space, computes correlation with visual tokens, retains the most relevant tokens, and softly recycles the rest via attention-based averaging, progressively reducing token counts from $N$ to a much smaller $M$ (e.g., from $N=576$ to $M=72$). The compression module leverages a Q/K/V formulation and a query-based correlation to select tokens, with deeper ViT layers chosen for compression to preserve semantic information. Experimental results across multiple VQA benchmarks show that QG-VTC matches or surpasses uncompressed models using as little as 1/8 of the original visual tokens and reduces overall compute significantly, highlighting substantial practical gains for efficient multimodal reasoning.
Abstract
Recent advances in Multi-modal Large Language Models (MLLMs) have shown significant progress in open-world Visual Question Answering (VQA). However, integrating visual information increases the number of processed tokens, leading to higher GPU memory usage and computational overhead. Images often contain more redundant information than text, and not all visual details are pertinent to specific questions. To address these challenges, we propose QG-VTC, a novel question-guided visual token compression method for MLLM-based VQA tasks. QG-VTC employs a pretrained text encoder and a learnable feed-forward layer to embed user questions into the vision encoder's feature space then computes correlation scores between the question embeddings and visual tokens. By selecting the most relevant tokens and softly compressing others, QG-VTC ensures fine-tuned relevance to user needs. Additionally, a progressive strategy applies this compression across different vision encoder layers, gradually reducing token numbers. This approach maximizes retention of question-relevant information while discarding irrelevant details. Experimental results show that our method achieves performance on par with uncompressed models using just 1/8 of the visual tokens. The code and model will be publicly available on GitHub.
