Table of Contents
Fetching ...

Cross-Self KV Cache Pruning for Efficient Vision-Language Inference

Xiaohuan Pei, Tao Huang, Chang Xu

TL;DR

This work tackles the memory and compute bottlenecks of long-context vision-language inference by proposing Cross-Self Pruning (CSP), a training-free KV-cache pruning method that separately prunes intra-modality and inter-modality tokens and uses an $n$-softmax to preserve attention distribution smoothness. By decomposing attention into $A^{s}$ and $A^{c}$ and applying independent top-$K$ selections, CSP reduces cache usage while maintaining or boosting performance on diverse multimodal tasks, achieving up to 41% improvements on challenging conversational embodied dialogue and up to 13.6% cache-budget reductions on MileBench. The method is validated across multiple backbones (e.g., LLaVA-7b/13b, InternVL, MobileVLM) and outperforms prior KV-cache pruning approaches like SnapKV, H2O, ReCo, and LOOK-M, demonstrating robustness to budget changes and architecture. The results indicate CSP provides a practical, scalable solution for efficient long-context multimodal inference with minimal training overhead and broad applicability.

Abstract

KV cache pruning has emerged as a promising technique for reducing memory and computation costs in long-context auto-regressive generation. Existing methods for vision-language models (VLMs) typically rely on self-attention scores from large language models (LLMs) to identify and prune irrelevant tokens. However, these approaches overlook the inherent distributional discrepancies between modalities, often leading to inaccurate token importance estimation and the over-pruning of critical visual tokens. To address this, we propose decomposing attention scores into intra-modality attention (within the same modality) and inter-modality attention (across modalities), enabling more precise KV cache pruning by independently managing these distinct attention types. Additionally, we introduce an n-softmax function to counteract distribution shifts caused by pruning, preserving the original smoothness of attention scores and ensuring stable performance. Our final training-free method, \textbf{C}ross-\textbf{S}elf \textbf{P}runing (CSP), achieves competitive performance compared to models with full KV caches while significantly outperforming previous pruning methods. Extensive evaluations on MileBench, a benchmark encompassing 29 multimodal datasets, demonstrate CSP's effectiveness, achieving up to a 41\% performance improvement on challenging tasks like conversational embodied dialogue while reducing the KV cache budget by 13.6\%. The code is available at https://github.com/TerryPei/CSP

Cross-Self KV Cache Pruning for Efficient Vision-Language Inference

TL;DR

This work tackles the memory and compute bottlenecks of long-context vision-language inference by proposing Cross-Self Pruning (CSP), a training-free KV-cache pruning method that separately prunes intra-modality and inter-modality tokens and uses an -softmax to preserve attention distribution smoothness. By decomposing attention into and and applying independent top- selections, CSP reduces cache usage while maintaining or boosting performance on diverse multimodal tasks, achieving up to 41% improvements on challenging conversational embodied dialogue and up to 13.6% cache-budget reductions on MileBench. The method is validated across multiple backbones (e.g., LLaVA-7b/13b, InternVL, MobileVLM) and outperforms prior KV-cache pruning approaches like SnapKV, H2O, ReCo, and LOOK-M, demonstrating robustness to budget changes and architecture. The results indicate CSP provides a practical, scalable solution for efficient long-context multimodal inference with minimal training overhead and broad applicability.

Abstract

KV cache pruning has emerged as a promising technique for reducing memory and computation costs in long-context auto-regressive generation. Existing methods for vision-language models (VLMs) typically rely on self-attention scores from large language models (LLMs) to identify and prune irrelevant tokens. However, these approaches overlook the inherent distributional discrepancies between modalities, often leading to inaccurate token importance estimation and the over-pruning of critical visual tokens. To address this, we propose decomposing attention scores into intra-modality attention (within the same modality) and inter-modality attention (across modalities), enabling more precise KV cache pruning by independently managing these distinct attention types. Additionally, we introduce an n-softmax function to counteract distribution shifts caused by pruning, preserving the original smoothness of attention scores and ensuring stable performance. Our final training-free method, \textbf{C}ross-\textbf{S}elf \textbf{P}runing (CSP), achieves competitive performance compared to models with full KV caches while significantly outperforming previous pruning methods. Extensive evaluations on MileBench, a benchmark encompassing 29 multimodal datasets, demonstrate CSP's effectiveness, achieving up to a 41\% performance improvement on challenging tasks like conversational embodied dialogue while reducing the KV cache budget by 13.6\%. The code is available at https://github.com/TerryPei/CSP

Paper Structure

This paper contains 23 sections, 7 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Distribution gap between self-attention and cross-attention during the decoding process in VLM tasks: (a) Kernel Density Estimation (KDE) of the attention weight distributions, and (b) Jensen-Shannon (JS) divergence scores between cross-attention and self-attention across all layers.
  • Figure 2: This illustration depicts the Cross-Self Pruning (CSP) KV cache process. The input sequence $\{ \text{\#image, \#text,\#image, \#text, ...} \}$ is projected onto query and key representations across multiple modalities. The $n$-softmax attention weights serve as the selection function, which is decomposed into intra- and cross-modality. Summation is performed along the query axis within each region, and top-$k$ keys are selected along the key dimension to retain tokens for pruning.
  • Figure 3: The impact of the cross-self ratio.
  • Figure 4: The benefit of n-softmax. We conduct the experiments on the ALFRED dataset by LLaVA-v1.5-7b.
  • Figure 5: The impact of the cache size budget.
  • ...and 2 more figures