Table of Contents
Fetching ...

PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models

Yingen Liu, Fan Wu, Ruihui Li, Zhuo Tang, Kenli Li

TL;DR

This work tackles the computational burden of multimodal large language models by introducing PAR, a training-free, prompt-aware token reduction method. PAR separates visual token redundancy into external (addressed via semantic retrieval guided by prompts) and internal (addressed by a token router) to retain only task-relevant tokens. The approach uses text prompts, graph-based semantic clustering, and a routing mechanism to reduce visual tokens by about 2x with minimal loss in accuracy, achieving up to 83% FLOPs reduction and an 89% compression ratio while preserving roughly 97% of baseline performance across VQA tasks, and even improving hallucination resistance. The results demonstrate that efficient token reduction can substantially accelerate multimodal reasoning without architectural changes, enabling more practical deployment of MLLMs in resource-constrained settings.

Abstract

Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks, but their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs. To address this, we introduce PAR (Prompt-Aware Token Reduction), a novel and plug-and-play approach that reduces visual tokens efficiently without compromising model performance. Unlike previous methods that rely heavily on attention mechanisms and overlooking cross-modal interactions , we uses a prompt-aware strategy to adpative identify and cluster essential visual tokens. PAR categorizes visual context redundancy into two types: external and internal. External redundancy is minimized through semantic retrieval, while internal redundancy is addressed using a token routing mechanism. This method substantially reduces computational load without requiring additional training or complex architectural modifications. \textbf{Experimental results demonstrate that across various visual question answering tasks, PAR reduces FLOPs by 83\% with a compression ratio of 89\%, while retaining 97\% of baseline accuracy.} The adaptive design of PAR achieves a 2x token reduction ratio compared to prior approaches, enabling a better balance between performance and efficiency.

PAR: Prompt-Aware Token Reduction Method for Efficient Large Multimodal Models

TL;DR

This work tackles the computational burden of multimodal large language models by introducing PAR, a training-free, prompt-aware token reduction method. PAR separates visual token redundancy into external (addressed via semantic retrieval guided by prompts) and internal (addressed by a token router) to retain only task-relevant tokens. The approach uses text prompts, graph-based semantic clustering, and a routing mechanism to reduce visual tokens by about 2x with minimal loss in accuracy, achieving up to 83% FLOPs reduction and an 89% compression ratio while preserving roughly 97% of baseline performance across VQA tasks, and even improving hallucination resistance. The results demonstrate that efficient token reduction can substantially accelerate multimodal reasoning without architectural changes, enabling more practical deployment of MLLMs in resource-constrained settings.

Abstract

Multimodal large language models (MLLMs) demonstrate strong performance across visual tasks, but their efficiency is hindered by significant computational and memory demands from processing long contexts in multimodal inputs. To address this, we introduce PAR (Prompt-Aware Token Reduction), a novel and plug-and-play approach that reduces visual tokens efficiently without compromising model performance. Unlike previous methods that rely heavily on attention mechanisms and overlooking cross-modal interactions , we uses a prompt-aware strategy to adpative identify and cluster essential visual tokens. PAR categorizes visual context redundancy into two types: external and internal. External redundancy is minimized through semantic retrieval, while internal redundancy is addressed using a token routing mechanism. This method substantially reduces computational load without requiring additional training or complex architectural modifications. \textbf{Experimental results demonstrate that across various visual question answering tasks, PAR reduces FLOPs by 83\% with a compression ratio of 89\%, while retaining 97\% of baseline accuracy.} The adaptive design of PAR achieves a 2x token reduction ratio compared to prior approaches, enabling a better balance between performance and efficiency.

Paper Structure

This paper contains 20 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Previous works relying on the attention mechanism, focus on global visual tokens and cause unnecessary redundancy. In contrast, our method is guided by prompts and focuses more effectively on the task-relevant visual tokens. Our approach achieves a token reduction ratio about 2x of previous methods.
  • Figure 2: The framework of our method. Given an input of image and text, PAR processes each modality separately: the text is structured using predefined templates, and the image undergoes semantic clustering. Prompt tokens are then retrieval with visual tokens to select relevant ones, reducing external redundancy. Finally, the token router refines these selections, removing internal redundancy before passing them to the language model (LM) for the answer generation.
  • Figure 3: Visualization of PAR.From left to right, we change the ratio of retrieval and the visual tokens become increasingly sparse. In the utmost right is the final result of PAR.
  • Figure 4: Accuracy results on four datasets (Text VQA, POPE, MMbench, GQA) using only the direct retrieval with token ratio of 50%, 40%, 30%, and 20%.
  • Figure 5: Hyperparameters ablation results about Hybrid Retrieval Ratio ,Token Router Threshold and Semantic Cluster Rate across three datasets. To illustrate the trade-off between performance and efficiency, we use Token Ratio as the x-axis and Accuracy as the y-axis. The red sign represents the selected parameters.