Table of Contents
Fetching ...

Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

Hongzhan Lin, Zhiqi Bai, Xinmiao Zhang, Sen Yang, Xiang Li, Siran Yang, Yunlong Xu, Jiaheng Liu, Yongchi Zhao, Jiamang Wang, Yuchi Xu, Wenbo Su, Bo Zheng

TL;DR

Long-context Transformer inference is hampered by KV cache memory usage. The authors reveal an asymmetric information flow: top-layer values are best reconstructed from the bottom layer, while keys benefit from bottom and middle layers, and propose FusedKV and FusedKV-Lite to reconstruct top-layer caches via cross-layer fusion. FusedKV uses a learnable fusion of bottom and middle caches, while FusedKV-Lite reuses a single bottom/top-layer pair to minimize I/O, all while preserving RoPE through symmetric weight constraints. Across 332M–4B models, these methods halve KV cache memory and achieve lower perplexity than full-cache baselines, with strong scaling, compatibility with GQA and MoE, and effective long-context performance.

Abstract

Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50\% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.

Reconstructing KV Caches with Cross-layer Fusion For Enhanced Transformers

TL;DR

Long-context Transformer inference is hampered by KV cache memory usage. The authors reveal an asymmetric information flow: top-layer values are best reconstructed from the bottom layer, while keys benefit from bottom and middle layers, and propose FusedKV and FusedKV-Lite to reconstruct top-layer caches via cross-layer fusion. FusedKV uses a learnable fusion of bottom and middle caches, while FusedKV-Lite reuses a single bottom/top-layer pair to minimize I/O, all while preserving RoPE through symmetric weight constraints. Across 332M–4B models, these methods halve KV cache memory and achieve lower perplexity than full-cache baselines, with strong scaling, compatibility with GQA and MoE, and effective long-context performance.

Abstract

Transformer decoders have achieved strong results across tasks, but the memory required for the KV cache becomes prohibitive at long sequence lengths. Although Cross-layer KV Cache sharing (e.g., YOCO, CLA) offers a path to mitigate KV Cache bottleneck, it typically underperforms within-layer methods like GQA. To understand the root cause, we investigate the information flow of keys and values of the top-layers. Our preliminary reveals a clear distribution: values are predominantly derived from the bottom layer, while keys draw more information from both bottom and middle layers. Building upon this, we propose FusedKV, whose top-layer KV caches are a learnable fusion of the most informative ones from the bottom and middle layers. This fusion operates directly on post-RoPE keys, preserving relative positional information without the computational cost of re-applying rotary embeddings. To further improve efficiency, we propose FusedKV-Lite, an cross-layer sharing approach, where top-layer KV caches are directly derived from the bottom-layer values and the middle-layer keys. Compared to FusedKV, FusedKV-Lite reduces I/O overhead at the cost of a slight increase in perplexity. In experiments on LLMs ranging from 332M to 4B parameters, our proposed method reduce 50\% cache memory while achieving lower validation perplexity than the standard Transformer decoder, establishing it as a memory-efficient, high-performance architectural alternative.

Paper Structure

This paper contains 68 sections, 18 equations, 12 figures, 17 tables.

Figures (12)

  • Figure 1: FusedKV and FusedKV-Lite reduce KV cache and prefilling latency by 2x (left) while also achieving superior pretraining loss on a 1.5B model compared to other methods (right).FusedKV converge around 1.26x faster than Vanilla.
  • Figure 2: Fusion weight for reconstructing key (left) and value (right) caches in the top 8 layers of a 16-layer model. The figure reveals a clear asymmetry in key-value caches.
  • Figure 3: Illustration of KV cache strategies. (a) Vanilla: The standard method with a unique KV cache for each layer. (b) FusedKV-Lite: For layers $i>n$, the Key cache is reused from layer $n$, and the Value cache from layer $1$. (c) FusedKV: For layers $i>n$, the caches are a learnable weighted fusion (denoted by $\otimes$) of the caches from layer $1$ and layer $n$.
  • Figure 4: Left: The attention throughput of different kernels (The higher is better). Right: Time to First Token (TTFT) of different models, showing the end-to-end prefilling performance (The lower is better). All methods are normalized by the vanilla (MHA) baseline.
  • Figure 5: Time Per Output Token (TPOT) performance ratios. The left panel displays the memory-bound scenario, and the right panel displays the compute-bound scenario.
  • ...and 7 more figures