Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models
Omer Faruk Deniz, Ruiyu Mao, Ruochen Li, Yapeng Tian, Latifur Khan
TL;DR
Vision tokens in Multimodal LLMs create a computational bottleneck because attention scales quadratically with token count. The authors propose Attention-Driven Self-Compression (ADSC), a parameter-free approach that progressively prunes vision tokens inside the LLM by applying a fixed downsampling ratio at selected layers, relying on the LLM's native attention to reorganize information into the remaining tokens and the text stream, without scoring or modifying attention. Training uses supervised instruction tuning with LoRA adapters and a reverse curriculum over pruning ratios, enabling robust compression with minimal forgetting. On LLaVA-1.5-7B, ADSC reduces FLOPs by 53.7% and KV-cache by 56.7% while preserving 98.2% of accuracy at a 66.7% vision-token budget, and it outperforms prior pruning baselines under high compression.
Abstract
Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.
