Efficient LLaMA-3.2-Vision by Trimming Cross-attended Visual Features
Jewon Lee, Ki-Ung Song, Seungmin Yang, Donguk Lim, Jaeyeon Kim, Wooksu Shin, Bo-Kyeong Kim, Yong Jae Lee, Tae-Ho Kim
TL;DR
The paper targets inefficiencies in cross-attention-based LVLMs caused by large image-token KV caches. It introduces Trimmed Llama, a training-free visual feature trimming method that exploits observed cross-attention sparsity. The method achieves comparable performance to full-feature models using about $50\%$ of image features, with notable latency and KV-cache reductions. This work offers a practical path to deploying high-resolution LVLMs with improved efficiency without additional training.
Abstract
Visual token reduction lowers inference costs caused by extensive image features in large vision-language models (LVLMs). Unlike relevant studies that prune tokens in self-attention-only LVLMs, our work uniquely addresses cross-attention-based models, which achieve superior performance. We identify that the key-value (KV) cache size for image tokens in cross-attention layers significantly exceeds that of text tokens in self-attention layers, posing a major compute bottleneck. To mitigate this issue, we exploit the sparse nature in cross-attention maps to selectively prune redundant visual features. Our Trimmed Llama effectively reduces KV cache demands without requiring additional training. By benefiting from 50%-reduced visual features, our model can reduce inference latency and memory usage while achieving benchmark parity.
