Table of Contents
Fetching ...

Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

Omer Faruk Deniz, Ruiyu Mao, Ruochen Li, Yapeng Tian, Latifur Khan

TL;DR

Vision tokens in Multimodal LLMs create a computational bottleneck because attention scales quadratically with token count. The authors propose Attention-Driven Self-Compression (ADSC), a parameter-free approach that progressively prunes vision tokens inside the LLM by applying a fixed downsampling ratio at selected layers, relying on the LLM's native attention to reorganize information into the remaining tokens and the text stream, without scoring or modifying attention. Training uses supervised instruction tuning with LoRA adapters and a reverse curriculum over pruning ratios, enabling robust compression with minimal forgetting. On LLaVA-1.5-7B, ADSC reduces FLOPs by 53.7% and KV-cache by 56.7% while preserving 98.2% of accuracy at a 66.7% vision-token budget, and it outperforms prior pruning baselines under high compression.

Abstract

Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.

Vision Token Reduction via Attention-Driven Self-Compression for Efficient Multimodal Large Language Models

TL;DR

Vision tokens in Multimodal LLMs create a computational bottleneck because attention scales quadratically with token count. The authors propose Attention-Driven Self-Compression (ADSC), a parameter-free approach that progressively prunes vision tokens inside the LLM by applying a fixed downsampling ratio at selected layers, relying on the LLM's native attention to reorganize information into the remaining tokens and the text stream, without scoring or modifying attention. Training uses supervised instruction tuning with LoRA adapters and a reverse curriculum over pruning ratios, enabling robust compression with minimal forgetting. On LLaVA-1.5-7B, ADSC reduces FLOPs by 53.7% and KV-cache by 56.7% while preserving 98.2% of accuracy at a 66.7% vision-token budget, and it outperforms prior pruning baselines under high compression.

Abstract

Multimodal Large Language Models (MLLMs) incur significant computational cost from processing numerous vision tokens through all LLM layers. Prior pruning methods operate either before the LLM, limiting generality due to diverse encoder-projector designs or within the LLM using heuristics that are incompatible with FlashAttention. We take a different approach: rather than identifying unimportant tokens, we treat the LLM itself as the optimal guide for compression. Observing that deeper layers naturally transmit vision-to-text information, we introduce Attention-Driven Self-Compression (ADSC), a simple, broadly applicable method that progressively reduces vision tokens using only the LLM's attention mechanism. Our method applies uniform token downsampling at selected layers, forming bottlenecks that encourage the model to reorganize and compress information into the remaining tokens. It requires no score computation, auxiliary modules, or attention modification, and remains fully compatible with FlashAttention. Applied to LLaVA-1.5, ADSC reduces FLOPs by 53.7% and peak KV-cache memory by 56.7%, while preserving 98.2% of the original model performance. Across multiple benchmarks, it outperforms prior pruning approaches in both efficiency and accuracy. Crucially, under high compression ratios, our method remains robust while heuristic-based techniques degrade sharply.
Paper Structure (13 sections, 4 equations, 2 figures, 4 tables)

This paper contains 13 sections, 4 equations, 2 figures, 4 tables.

Figures (2)

  • Figure 1: (a) Overview of our attention-driven self-compression framework. (b) Vision tokens are progressively downsampled at selected LLM layers ($\ell_1, \ell_2, \ldots, \ell_K$), forming information bottlenecks that prompt the model's attention mechanism to reorganize and compress visual information.
  • Figure 2: Performance vs. vision token count on six benchmarks (GQA, MME, POPE, ScienceQA, TextVQA, and VQAv2). Our Attention‑Driven Self‑Compression (ADSC) consistently outperforms ToMe, FastV, and PyramidDrop across all token budgets, and the performance gap widens as the number of retained vision tokens decreases. The dashed horizontal line denotes the full 576‑token LLaVA‑1.5‑7B baseline (no compression).