Table of Contents
Fetching ...

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

Jewon Lee, Ki-Ung Song, Seungmin Yang, Donguk Lim, Jaeyeon Kim, Wooksu Shin, Bo-Kyeong Kim, Yong Jae Lee, Tae-Ho Kim

TL;DR

The paper targets inefficiencies in cross-attention-based LVLMs caused by large image-token KV caches. It introduces Trimmed Llama, a training-free visual feature trimming method that exploits observed cross-attention sparsity. The method achieves comparable performance to full-feature models using about $50\%$ of image features, with notable latency and KV-cache reductions. This work offers a practical path to deploying high-resolution LVLMs with improved efficiency without additional training.

Abstract

Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.

Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features

TL;DR

The paper targets inefficiencies in cross-attention-based LVLMs caused by large image-token KV caches. It introduces Trimmed Llama, a training-free visual feature trimming method that exploits observed cross-attention sparsity. The method achieves comparable performance to full-feature models using about of image features, with notable latency and KV-cache reductions. This work offers a practical path to deploying high-resolution LVLMs with improved efficiency without additional training.

Abstract

Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.

Paper Structure

This paper contains 13 sections, 2 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Comparison of LVLM architectures. (a) Self-attention-only models process both image and text embeddings in all attention layers. (b) Cross-attention-based models use image features exclusively for KV operations in cross-attention layers, enabling efficient multimodal integration.
  • Figure 2: Proposed method. Image features are pruned in the first cross-attention block using a criterion derived from attention weights. The features serve as inputs for the keys and values in subsequent cross-attention layers, with the compressed keys and values stored in the KV cache (blue-shaded area).
  • Figure 3: KV cache memory. (a) As batch size increases, the KV cache volume from image features grows. (b) As the language token count grows, the KV cache size in cross-attention still dominates that of self-attention, up to a certain number of tokens.
  • Figure 4: Aggregated cross-attention weights. (a) The attention weights at the first cross-attention layer are summed over attention heads and text queries. (b) The attention weights for each cross-attention layer are summed over heads and visualized with the sequence length clipped to 400 for better visibility. Over different layers, specific image tokens consistently attract more attention from query tokens, indicating a structured sparse pattern.
  • Figure 5: Results under different compression ratios. Even with up to 50% reduction of visual features, our method retains the performance of the original model.
  • ...and 4 more figures