PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

Zhonghua Jiang; Kunxi Li; Yiyun Zhou; Sihao Liu; Zhaode Wang; Chengfei lv; Shengyu Zhang

PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

Zhonghua Jiang, Kunxi Li, Yiyun Zhou, Sihao Liu, Zhaode Wang, Chengfei lv, Shengyu Zhang

TL;DR

PureKV tackles the memory and latency bottlenecks of Vision-Language LLMs by enabling plug-and-play KV-cache compression that remains compatible with efficient attention backends. It introduces Cross-Layer Importance Estimation (CLIE), which uses lower-layer attention scores together with high-layer V-vector norms to rank high-layer KV entries without computing high-layer attention, and Spatial-Temporal Sparse Attention (ST-SpAttn) to purify video KV caches by suppressing spatial noise and temporal redundancy. Statistical validation via Spearman correlations supports the cross-layer estimation approach, and extensive experiments on VideoLLaMA2 and Qwen2.5-VL-7B show up to 5x KV-cache compression and 3.16x prefill acceleration with minimal quality loss. These contributions enable scalable, real-time deployment of high-resolution VLLMs by reducing memory footprint and latency while preserving downstream accuracy.

Abstract

Vision-Language Large Models (VLLMs) face significant efficiency challenges when processing high-resolution inputs. The quadratic complexity in attention and autoregressive generation, as well as the constantly growing key value (KV) cache size, severely hinder the prefilling and decoding stages. Recent efforts have attempted to compress KV cache by identifying and pruning KV cache of less important tokens, but these methods typically rely on attention scores to estimate token importance, making them incompatible with efficient attention mechanisms such as FlashAttention and Sparse Attention, which do not explicitly compute attention matrices. Moreover, existing methods overlook how sparse attention, while accelerating the prefilling stage, alters the information structure of the KV cache, thereby compromising the effectiveness of downstream KV cache compression strategies. To address this issue, we propose PureKV, a plug-and-play framework for joint optimization of sparse attention and KV cache compression. We first introduce a KV cache compression strategy that is fully compatible with efficient attention accelerators. Our method utilizes lower layer attention scores to estimate the importance of high layers' KV cache, enabling active pruning without compromising accuracy. In addition, we have designed a Spatial-Temporal Sparse Attention (ST-SpAttn) module specifically tailored for video KV cache compression algorithms. This module combines spatial and temporal attention sparsity to improve the compression efficiency of KV cache optimization algorithms by purifying spatial noise and temporal redundancy in KV cache. At the same time, ST-SpAttn also accelerated the prefilling stage of VLLMs. Extensive experiments on VLLMs (VideoLLaMA2, Qwen2.5-VL) have shown that PureKV achieves 5.0 times KV cache compression and 3.16 times prefill acceleration, with negligible quality degradation.

PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

TL;DR

Abstract

PureKV: Plug-and-Play KV Cache Optimization with Spatial-Temporal Sparse Attention for Vision-Language Large Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)