Table of Contents
Fetching ...

Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

Fatih Ilhan, Gaowen Liu, Ramana Rao Kompella, Selim Furkan Tekin, Tiansheng Huang, Zachary Yahn, Yichang Xu, Ling Liu

Abstract

Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack combined with eviction, quantization and kernel fusion, showing further efficiency gains for resource-limited environments.

Attention-aware Inference Optimizations for Large Vision-Language Models with Memory-efficient Decoding

Abstract

Large Vision-Language Models (VLMs) have achieved remarkable success in multi-modal reasoning, but their inference time efficiency remains a significant challenge due to the memory overhead during decoding, especially when the query and answer of VLMs consist of long sequences of visual and text tokens. This paper presents AttentionPack, an adaptive and attention-aware optimization framework tailored for large vision-language models with improving memory-efficiency during decoding, focusing on addressing the challenges due to the increased high number of visual inputs and interactions, particularly in long-context tasks with multiple high-resolution images or videos. AttentionPack is novel in two aspects: (i) We introduce a multi-head attention compaction method for economically storing key and value matrices by exploiting the implicit low-rank structure, and (ii) we develop a token-specific attention-aware decompression mechanism to reduce latency overhead. Experimental results on multiple benchmarks demonstrate that AttentionPack improves memory efficiency by up to 8x, enabling higher batch sizes and faster batch inference while preserving the model output quality or longer context lengths for superior retrieval performance. We also report the effectiveness of AttentionPack combined with eviction, quantization and kernel fusion, showing further efficiency gains for resource-limited environments.
Paper Structure (21 sections, 1 equation, 10 figures, 6 tables, 1 algorithm)

This paper contains 21 sections, 1 equation, 10 figures, 6 tables, 1 algorithm.

Figures (10)

  • Figure 1: The schematic of the workflow during inference. After the prefill phase, at each decoding step, we first compress the cache along combined heads. We perform attention-aware partial decompression before attention score computation.
  • Figure 2: Rank vs explained variance ratio without/with combining along head axis before compression for key and value vectors.
  • Figure 3: Visualization of compression and partial decompression.
  • Figure 4: Impact of attention-aware decompression. Each line represents the results when AttentionPack is applied for key (k), value (v) caches or both (kv). Every line has four dots with the size of each representing the ratio of visual tokens ($r_1 \in \{0.125, 0.25, 0.375, 0.5\}$) decompressed with the full rank.
  • Figure 5: Total decode latency for 100 queries with LLaVA1.5-7B.
  • ...and 5 more figures