Table of Contents
Fetching ...

PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

Zhonghua Jiang, Kunxi Li, Yiyun Zhou, Sihao Liu, Zhaode Wang, Chengfei lv, Shengyu Zhang

TL;DR

PureKV tackles the memory and latency bottlenecks of Vision-Language LLMs by enabling plug-and-play KV-cache compression that remains compatible with efficient attention backends. It introduces Cross-Layer Importance Estimation (CLIE), which uses lower-layer attention scores together with high-layer V-vector norms to rank high-layer KV entries without computing high-layer attention, and Spatial-Temporal Sparse Attention (ST-SpAttn) to purify video KV caches by suppressing spatial noise and temporal redundancy. Statistical validation via Spearman correlations supports the cross-layer estimation approach, and extensive experiments on VideoLLaMA2 and Qwen2.5-VL-7B show up to 5x KV-cache compression and 3.16x prefill acceleration with minimal quality loss. These contributions enable scalable, real-time deployment of high-resolution VLLMs by reducing memory footprint and latency while preserving downstream accuracy.

Abstract

Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address this issue, we propose PureKV, a plug-and-play framework for joint optimization of sparse attention and KV cache compression. We first introduce a KV cache compression strategy that is fully compatible with efficient attention accelerators. Our method utilizes lower layer attention scores to estimate the importance of high layers' KV cache, enabling active pruning without compromising accuracy. In addition, we have designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically tailored for video KV cache compression algorithms. This module combines spatial and temporal attention sparsity to improve the compression efficiency of KV cache optimization algorithms by purifying spatial noise and temporal redundancy in KV cache. At the same time, ST-SpAttn also accelerated the prefilling stage of VLLMs. Extensive experiments on VLLMs (VideoLLaMA2, Qwen2.5-VL) have shown that PureKV achieves 5.0 times KV cache compression and 3.16 times prefill acceleration, with negligible quality degradation.

PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

TL;DR

PureKV tackles the memory and latency bottlenecks of Vision-Language LLMs by enabling plug-and-play KV-cache compression that remains compatible with efficient attention backends. It introduces Cross-Layer Importance Estimation (CLIE), which uses lower-layer attention scores together with high-layer V-vector norms to rank high-layer KV entries without computing high-layer attention, and Spatial-Temporal Sparse Attention (ST-SpAttn) to purify video KV caches by suppressing spatial noise and temporal redundancy. Statistical validation via Spearman correlations supports the cross-layer estimation approach, and extensive experiments on VideoLLaMA2 and Qwen2.5-VL-7B show up to 5x KV-cache compression and 3.16x prefill acceleration with minimal quality loss. These contributions enable scalable, real-time deployment of high-resolution VLLMs by reducing memory footprint and latency while preserving downstream accuracy.

Abstract

Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address this issue, we propose PureKV, a plug-and-play framework for joint optimization of sparse attention and KV cache compression. We first introduce a KV cache compression strategy that is fully compatible with efficient attention accelerators. Our method utilizes lower layer attention scores to estimate the importance of high layers' KV cache, enabling active pruning without compromising accuracy. In addition, we have designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically tailored for video KV cache compression algorithms. This module combines spatial and temporal attention sparsity to improve the compression efficiency of KV cache optimization algorithms by purifying spatial noise and temporal redundancy in KV cache. At the same time, ST-SpAttn also accelerated the prefilling stage of VLLMs. Extensive experiments on VLLMs (VideoLLaMA2, Qwen2.5-VL) have shown that PureKV achieves 5.0 times KV cache compression and 3.16 times prefill acceleration, with negligible quality degradation.

Paper Structure

This paper contains 17 sections, 7 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: (a) The traditional KV cache compression method based on attention score calculates attention weights at each layer to evaluate tokens importance, which is not compatible with efficient attention mechanisms such as FlashAttention and Sparse Attention. (b) PureKV utilizes lower layer attention scores to identify critical KV cache in high layers, and is compatible with efficient attention mechanisms in the high layers, accelerating the prefilling stage. (c) Dense Attention leads to the gradual confusion of important and unimportant information at high layer. (d) ST-SpAttn generates cleaner and more structured KV, reducing noise while preserving key spatiotemporal dependencies.
  • Figure 2: Overview of our PureKV method. PureKV is a plug-and-play framework for KV cache optimization, compatible with efficient attention mechanisms. PureKV introduces a lightweight importance estimator that utilizes layer attention scores and the L2 norm of high V vectors to estimate KV cache importance, avoiding explicit computation of high attention. By combining Spatial-Temporal Sparse Attention, PureKV suppresses background noise and irrelevant visual interference, eliminates redundancy in consecutive frames, and the resulting purified KV cache significantly improves the accuracy and robustness of subsequent KV cache compression strategies.
  • Figure 3: Under fixed attention weight conditions, the size of the V vector also significantly affects the output results of the attention mechanism.
  • Figure 4: Cross-Layer importance Estimation correlation analysis. The experiment shows that the high layer KV cache importance estimation based on lower layer attention scores is significantly positively correlated with the true high layer KV cache importance. (VideoLLaMA2 uses Group query attention, divides the heads into 4 groups, with each group sharing KV cache.)
  • Figure 5: CLIE Layer Index: the lower layer index used to estimate importance of high layer KV cache. ST-SpAttn Layer Index: the layer index at which SpAttn is activated. Since ST-SpAttn does not explicitly calculate attention score, the ST-SpAttn Layer Index is greater than the CLIE Layer Index.
  • ...and 1 more figures