Table of Contents
Fetching ...

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

Huanxuan Liao, Yixing Xu, Shizhu He, Guanchen Li, Xuanwu Yin, Dong Li, Emad Barsoum, Jun Zhao, Kang Liu

TL;DR

SparK tackles the KV cache memory bottleneck in long-context LLM inference by introducing unstructured, query-aware channel pruning with a lightweight on-the-fly recovery mechanism. By selecting the top salient channels per token and reconstructing pruned entries during attention, SparK sustains attention fidelity under aggressive pruning and remains compatible with other KV compression techniques. Empirical results across LongBench and RULER show substantial memory savings (over 30% KV storage reduction) with minimal accuracy loss (often under 5%), and robustness across models like LLaMA-3 and Qwen-3. The method is training-free and plug-and-play, enabling longer contexts within fixed memory budgets and broad applicability as a drop-in KV-cache optimization.

Abstract

Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at https://github.com/Xnhyacinth/SparK.

SparK: Query-Aware Unstructured Sparsity with Recoverable KV Cache Channel Pruning

TL;DR

SparK tackles the KV cache memory bottleneck in long-context LLM inference by introducing unstructured, query-aware channel pruning with a lightweight on-the-fly recovery mechanism. By selecting the top salient channels per token and reconstructing pruned entries during attention, SparK sustains attention fidelity under aggressive pruning and remains compatible with other KV compression techniques. Empirical results across LongBench and RULER show substantial memory savings (over 30% KV storage reduction) with minimal accuracy loss (often under 5%), and robustness across models like LLaMA-3 and Qwen-3. The method is training-free and plug-and-play, enabling longer contexts within fixed memory budgets and broad applicability as a drop-in KV-cache optimization.

Abstract

Long-context inference in large language models (LLMs) is increasingly constrained by the KV cache bottleneck: memory usage grows linearly with sequence length, while attention computation scales quadratically. Existing approaches address this issue by compressing the KV cache along the temporal axis through strategies such as token eviction or merging to reduce memory and computational overhead. However, these methods often neglect fine-grained importance variations across feature dimensions (i.e., the channel axis), thereby limiting their ability to effectively balance efficiency and model accuracy. In reality, we observe that channel saliency varies dramatically across both queries and positions: certain feature channels carry near-zero information for a given query, while others spike in relevance. To address this oversight, we propose SPARK, a training-free plug-and-play method that applies unstructured sparsity by pruning KV at the channel level, while dynamically restoring the pruned entries during attention score computation. Notably, our approach is orthogonal to existing KV compression and quantization techniques, making it compatible for integration with them to achieve further acceleration. By reducing channel-level redundancy, SPARK enables processing of longer sequences within the same memory budget. For sequences of equal length, SPARK not only preserves or improves model accuracy but also reduces KV cache storage by over 30% compared to eviction-based methods. Furthermore, even with an aggressive pruning ratio of 80%, SPARK maintains performance with less degradation than 5% compared to the baseline eviction method, demonstrating its robustness and effectiveness. Our code will be available at https://github.com/Xnhyacinth/SparK.

Paper Structure

This paper contains 27 sections, 15 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Illustrative comparisons among (a) full KV cache, (b) eviction-based KV compression, (c) structured channel pruning-based KV reduction, and (d) our proposed SparK, which employs unstructured channel pruning with subsequent recovery during attention score computation.
  • Figure 2: Rethinking the salience of key channels using LLaMA3.1-8B-Instruct llama3 on Longbench longbench. All visualizations are derived from the 18th attention layer and the 0th attention head.
  • Figure 3: An illustration of SparK.SparK computes channel-wise saliency scores and applies unstructured pruning during prefill. During decoding, SparK leverage $\boldsymbol{\mathcal{F}}$ and sampling from the cached distribution to reconstruct the pruned channels and then perform standard full attention.
  • Figure 4: Performance–Efficiency analysis of SparK on LLaMA3-8B-Instruct. (a) LongBench average performance under varying pruning ratios ($\lambda$). SparK significantly outperforms ThinK across all compression levels. (b) Throughput (tokens/s) with increasing input length. SparK maintains stable decoding speed across long sequences (up to 128k) (c) Cache size vs. performance trade-off. SparK achieves favorable efficiency–performance balance compared to ThinK and SnapKV.
  • Figure 5: Visualization of QK-score distributions across channel indices for 6 representative tokens. Brighter hues indicate higher attention contributions, revealing: (1) Position-dependent sparsity (e.g., Token 0 vs 1195), (2) Task-critical channel clustering, (3) High variance in salient channel indices.
  • ...and 2 more figures