ZSMerge: Zero-Shot KV Cache Compression for Memory-Efficient Long-Context LLMs
Xin Liu, Xudong Wang, Pei Liu, Guoming Tang
TL;DR
ZSMerge tackles the memory and compute bottlenecks of long-context LLMs by introducing a zero-shot KV cache compression framework that combines fine-grained head-level budget allocation, residual merging, and compensated attention scoring. The method achieves sublinear KV cache growth while preserving generation quality, delivering substantial memory savings (e.g., ~82% VRAM reduction at 54K tokens) and throughput gains (over threefold at extreme contexts) without retraining. Empirical results across multiple models (LLaMA2-7B, Falcon-7B, Mistral-7B-Instruct) and benchmarks (LongBench, InfiniteBench, XSum) show ZSMerge outperforms eviction-based baselines and generalizes across architectures and tasks, maintaining robust performance under tight cache budgets. The approach enables scalable long-context inference on resource-constrained devices, reduces energy consumption, and supports broad deployment without architectural changes or fine-tuning.
Abstract
The linear growth of key-value (KV) cache memory and quadratic computational in attention mechanisms complexity pose significant bottlenecks for large language models (LLMs) in long-context processing. While existing KV cache optimization methods address these challenges through token pruning or feature merging, they often incur irreversible information loss or require costly parameter retraining. To this end, we propose ZSMerge, a dynamic KV cache compression framework designed for efficient cache management, featuring three key operations: (1) fine-grained memory allocation guided by multi-dimensional token importance metrics at head-level granularity, (2) a residual merging mechanism that preserves critical context through compensated attention scoring, and (3) a zero-shot adaptation mechanism compatible with diverse LLM architectures without requiring retraining. ZSMerge significantly enhances memory efficiency and inference speed with negligible performance degradation across LLMs. When applied to LLaMA2-7B, it demonstrates a 20:1 compression ratio for key-value cache retention (reducing memory footprint to 5\% of baseline) while sustaining comparable generation quality, coupled with triple throughput gains at extreme 54k-token contexts that eliminate out-of-memory failures. The code is available at https://github.com/SusCom-Lab/ZSMerge.
