Table of Contents
Fetching ...

QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA

Shuai Li, Jian Xu, Xiao-Hui Li, Chao Deng, Lin-Lin Huang

TL;DR

The paper tackles the high memory and computation cost of open-world VQA with Multimodal LLMs by introducing QG-VTC, a question-guided visual token compression method that operates inside the vision encoder in a hierarchical, multi-step fashion. It embeds the user question into the vision space, computes correlation with visual tokens, retains the most relevant tokens, and softly recycles the rest via attention-based averaging, progressively reducing token counts from $N$ to a much smaller $M$ (e.g., from $N=576$ to $M=72$). The compression module leverages a Q/K/V formulation and a query-based correlation to select tokens, with deeper ViT layers chosen for compression to preserve semantic information. Experimental results across multiple VQA benchmarks show that QG-VTC matches or surpasses uncompressed models using as little as 1/8 of the original visual tokens and reduces overall compute significantly, highlighting substantial practical gains for efficient multimodal reasoning.

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs) have shown significant progress in open-world Visual Question Answering (VQA). However, integrating visual information increases the number of processed tokens, leading to higher GPU memory usage and computational overhead. Images often contain more redundant information than text, and not all visual details are pertinent to specific questions. To address these challenges, we propose QG-VTC, a novel question-guided visual token compression method for MLLM-based VQA tasks. QG-VTC employs a pretrained text encoder and a learnable feed-forward layer to embed user questions into the vision encoder's feature space then computes correlation scores between the question embeddings and visual tokens. By selecting the most relevant tokens and softly compressing others, QG-VTC ensures fine-tuned relevance to user needs. Additionally, a progressive strategy applies this compression across different vision encoder layers, gradually reducing token numbers. This approach maximizes retention of question-relevant information while discarding irrelevant details. Experimental results show that our method achieves performance on par with uncompressed models using just 1/8 of the visual tokens. The code and model will be publicly available on GitHub.

QG-VTC: Question-Guided Visual Token Compression in MLLMs for Efficient VQA

TL;DR

The paper tackles the high memory and computation cost of open-world VQA with Multimodal LLMs by introducing QG-VTC, a question-guided visual token compression method that operates inside the vision encoder in a hierarchical, multi-step fashion. It embeds the user question into the vision space, computes correlation with visual tokens, retains the most relevant tokens, and softly recycles the rest via attention-based averaging, progressively reducing token counts from to a much smaller (e.g., from to ). The compression module leverages a Q/K/V formulation and a query-based correlation to select tokens, with deeper ViT layers chosen for compression to preserve semantic information. Experimental results across multiple VQA benchmarks show that QG-VTC matches or surpasses uncompressed models using as little as 1/8 of the original visual tokens and reduces overall compute significantly, highlighting substantial practical gains for efficient multimodal reasoning.

Abstract

Recent advances in Multi-modal Large Language Models (MLLMs) have shown significant progress in open-world Visual Question Answering (VQA). However, integrating visual information increases the number of processed tokens, leading to higher GPU memory usage and computational overhead. Images often contain more redundant information than text, and not all visual details are pertinent to specific questions. To address these challenges, we propose QG-VTC, a novel question-guided visual token compression method for MLLM-based VQA tasks. QG-VTC employs a pretrained text encoder and a learnable feed-forward layer to embed user questions into the vision encoder's feature space then computes correlation scores between the question embeddings and visual tokens. By selecting the most relevant tokens and softly compressing others, QG-VTC ensures fine-tuned relevance to user needs. Additionally, a progressive strategy applies this compression across different vision encoder layers, gradually reducing token numbers. This approach maximizes retention of question-relevant information while discarding irrelevant details. Experimental results show that our method achieves performance on par with uncompressed models using just 1/8 of the visual tokens. The code and model will be publicly available on GitHub.

Paper Structure

This paper contains 21 sections, 6 equations, 9 figures, 5 tables.

Figures (9)

  • Figure 1: The overall architecture of QG-VTC. The compression module within the vision encoder is capable of compressing visual information under the guidance of a user's question. The projector is responsible for projecting the compressed visual information into the semantic space of the LLM. Subsequently, the vision tokens and text tokens are concatenated into a sequence, which is then input into the LLM to obtain the answer.
  • Figure 2: Calculation details of the compression module.
  • Figure 3: The relationship between computational load and performance (evaluation on VQAT). 100% means the baseline's (576 visual tokens) performance and computational load.
  • Figure 4: The visualization results of QG-VTC. The red box represents the area corresponding to the answer. The unmasked areas indicate the retained visual tokens.
  • Figure 5: Question: What is the brand of this camera? Ground_truth: dakota digital. Output: dakota digital.(✓) 492, 408, 324, 240, 156, 72 represent the number of retained visual tokens, respectively. The red box represents the visual tokens corresponding to the answer.
  • ...and 4 more figures