Table of Contents
Fetching ...

KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing

Yifei Yang, Zouying Cao, Qiguang Chen, Libo Qin, Dongjie Yang, Hai Zhao, Zhi Chen

TL;DR

KVSharer introduces a training-free, layer-wise KV cache sharing method that exploits dissimilar KV caches across Transformer layers to reduce memory during LLM inference. By ranking layer pairs by dissimilarity and validating replacements via output similarity, it identifies a sharing strategy that substantially lowers KV cache memory (around 30%) while preserving most of the model's performance and delivering generation speedups (at least 1.3x). The approach is plug-and-play and compatible with existing intra-layer KV cache compression methods, enabling further memory savings when combined. Across multiple models and benchmarks, KVSharer demonstrates robust generalizability, reasonable strategy search time, and practical benefits for long sequences and real-time generation.

Abstract

The development of large language models (LLMs) has significantly expanded model sizes, resulting in substantial GPU memory requirements during inference. The key and value storage of the attention map in the KV (key-value) cache accounts for more than 80\% of this memory consumption. Nowadays, most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer but few works consider layer-wise compression. In this paper, we propose a plug-and-play method called \textit{KVSharer}, which shares the KV cache between layers to achieve layer-wise compression. Rather than intuitively sharing based on higher similarity, we discover a counterintuitive phenomenon: sharing dissimilar KV caches better preserves the model performance. Experiments show that \textit{KVSharer} can reduce KV cache computation by 30\%, thereby lowering memory consumption without significantly impacting model performance and it can also achieve at least 1.3 times generation acceleration. Additionally, we verify that \textit{KVSharer} is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.

KVSharer: Efficient Inference via Layer-Wise Dissimilar KV Cache Sharing

TL;DR

KVSharer introduces a training-free, layer-wise KV cache sharing method that exploits dissimilar KV caches across Transformer layers to reduce memory during LLM inference. By ranking layer pairs by dissimilarity and validating replacements via output similarity, it identifies a sharing strategy that substantially lowers KV cache memory (around 30%) while preserving most of the model's performance and delivering generation speedups (at least 1.3x). The approach is plug-and-play and compatible with existing intra-layer KV cache compression methods, enabling further memory savings when combined. Across multiple models and benchmarks, KVSharer demonstrates robust generalizability, reasonable strategy search time, and practical benefits for long sequences and real-time generation.

Abstract

The development of large language models (LLMs) has significantly expanded model sizes, resulting in substantial GPU memory requirements during inference. The key and value storage of the attention map in the KV (key-value) cache accounts for more than 80\% of this memory consumption. Nowadays, most existing KV cache compression methods focus on intra-layer compression within a single Transformer layer but few works consider layer-wise compression. In this paper, we propose a plug-and-play method called \textit{KVSharer}, which shares the KV cache between layers to achieve layer-wise compression. Rather than intuitively sharing based on higher similarity, we discover a counterintuitive phenomenon: sharing dissimilar KV caches better preserves the model performance. Experiments show that \textit{KVSharer} can reduce KV cache computation by 30\%, thereby lowering memory consumption without significantly impacting model performance and it can also achieve at least 1.3 times generation acceleration. Additionally, we verify that \textit{KVSharer} is compatible with existing intra-layer KV cache compression methods, and combining both can further save memory.

Paper Structure

This paper contains 31 sections, 7 figures, 9 tables, 1 algorithm.

Figures (7)

  • Figure 1: Previous methods primarily focus on discarding Keys and Values within layers. In contrast, we share KV caches across layers based on their dissimilarity.
  • Figure 2: An illustration of the strategy searching process of the KVSharer. For a given LLM, process (a) performs inference on the calibration dataset and computes the euclidean distance between flattened KV cache vectors from any two layers, sorting pairs in descending order. (b) KV cache pairs are sequentially replaced, ensuring the final hidden-state similarity with the original model exceeds threshold $\mathcal{T}$ until the KV cache compression ratio reaches $\mathcal{R}$.
  • Figure 3: During the inference process of prefill and generation, according to the currently found optimal sharing strategy, KVSharer directly copy the result of the KV cache from a previously computed layer to the current layer during the forward computation.
  • Figure 4: The searching time cost by KVSharer for different models. The search time is typically around 60 seconds or less.
  • Figure 5: The model's perplexity on the Wikipedia dataset at different compression rates. "+H2O" and "+Pyr." refer to the additional use of the H2O and PyramidInfer for intra-layer compression.
  • ...and 2 more figures