EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

Yixuan Wang; Shiyu Ji; Yijun Liu; Qingfu Zhu; Wanxiang Che

EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

Yixuan Wang, Shiyu Ji, Yijun Liu, Qingfu Zhu, Wanxiang Che

Abstract

The increasing memory demand of the Key-Value (KV) cache poses a significant bottleneck for Large Language Models (LLMs) in long-context applications. Existing low-rank compression methods often rely on irreversible parameter transformations, sacrificing the flexibility to switch back to full-precision inference when memory is abundant. In this paper, we propose EchoKV, a flexible KV cache compression scheme that enables on-demand transitions between standard and compressed inference. Unlike traditional compression-decompression paradigms, EchoKV utilizes a lightweight network to reconstruct the residual KV components from a partial subset, leveraging intrinsic inter-layer and intra-layer similarities among attention heads. We further introduce a two-stage fine-tuning strategy that allows for rapid, low-cost training (e.g., ~1 A100 GPU-hour for a 7B model). Experimental results on LongBench and RULER demonstrate that EchoKV consistently outperforms existing methods across various compression ratios while maintaining high throughput for short-context scenarios.

EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

Abstract

Paper Structure (41 sections, 10 equations, 5 figures, 7 tables)

This paper contains 41 sections, 10 equations, 5 figures, 7 tables.

Introduction
Related Work
KV Cache Compression
Quantization.
Dimensionality reduction.
KV Cache Eviction
Methodology
Overview of EchoKV
Network Architecture
Global Cache Inputs.
Local Cache Inputs.
Training Details
Reconstruction loss.
Attention loss.
Experiments
...and 26 more sections

Figures (5)

Figure 1: Illustration of the differences between existing low-rank sharing approaches and EchoKV. Unlike the compression-decompression paradigm, EchoKV employs a lightweight network to reconstruct the residual KV components of specific attention heads from others.
Figure 2: Schematic illustration of the training and inference workflows for EchoKV compared to the standard KV cache. The figure presents a schematic illustration for a single token, where distinct cache blocks correspond to different attention heads.
Figure 3: Analysis experiments on EchoKV. All evaluations are conducted using Llama3.1-8B-Instruct grattafiori2024llama on the LongBench bai2024longbench benchmark.
Figure 4: Visualization of NIAH results on Llama-3.1-8B-Instruct with a compression ratio of 0.3.
Figure 5: Visualization of NIAH results on Mistral-7B-Instruct-v0.3 with a compression ratio of 0.3.

EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

Abstract

EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

Authors

Abstract

Table of Contents

Figures (5)