Table of Contents
Fetching ...

ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

Surendra Pathak, Bo Han

Abstract

While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades performance. We therefore introduce a weighted soft merging component that merges semantically similar tokens, preserving only the most feature-dense visual patches for subsequent layers. ASAP achieves virtually lossless compression of visual context, retaining 99.02% of the original LLaVA-NeXT-7B performance while aggressively slashing computational FLOPs by ~80%.

ASAP: Attention-Shift-Aware Pruning for Efficient LVLM Inference

Abstract

While Large Vision-Language Models (LVLMs) demonstrate exceptional multi-modal capabilities, the quadratic computational cost of processing high-resolution visual tokens remains a critical bottleneck. Though recent token reduction strategies attempt to accelerate inference, such methods inadequately exploit attention values and fail to address token redundancy. More critically, they overlook the ``attention shift'' phenomenon inherent in LVLMs, which skews token attention scores. In this work, we propose ASAP, a novel training-free, KV-Cache-compatible pruning recipe that comprehensively addresses these limitations. First, we mitigate the attention shift by utilizing a dynamic bidirectional soft attention mask, ensuring the selection of genuinely informative tokens rather than naive attention-based selection. Second, we posit that high semantic redundancy within the token set degrades performance. We therefore introduce a weighted soft merging component that merges semantically similar tokens, preserving only the most feature-dense visual patches for subsequent layers. ASAP achieves virtually lossless compression of visual context, retaining 99.02% of the original LLaVA-NeXT-7B performance while aggressively slashing computational FLOPs by ~80%.
Paper Structure (26 sections, 8 equations, 5 figures, 4 tables)

This paper contains 26 sections, 8 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: ASAP integration within the LVLM architecture. The plug-and-play pruning module is embedded directly within the language model backbone, operating between standard decoder layers to progressively compress the visual token sequence during the inference forward pass.
  • Figure 2: Qualitative comparison of token selection. Retained tokens (top 128) are visible, while dropped tokens are darkened. FastV demonstrates a bottom spatial bias, missing the distant third car. ASAP successfully mitigates this bias, preserving critical background features to accurately count all cars.
  • Figure 3: The two-stage ASAP architecture. After decoder layer $L$, a salience-guided bidirectional matrix retains top-$k$ salient visual tokens. A feature-based similarity module then consolidates redundant tokens via salience-weighted merging and a salvage mechanism. This dual-filtering pipeline passes a compact, feature-dense visual set to layer $L+1$, strictly preserving all text and system prompts.
  • Figure 4: Ablation of ASAP modules on TextVQA and MME. FastV prunes naively using Top-k selection. Important uses Salience-Guided Bidirectional attention values. ASAP adds Salience-Weighted Consolidation and Budget Reallocation. The progressive gains confirm that combining attention-driven selection with feature consolidation maximizes semantic density.
  • Figure 5: Impact of integrating SG-BiMask in existing pruning frameworks. Our SG-BiMask component consistently improves performance when integrated with the standard causal attention baseline (FastV) across all evaluated benchmarks.