Table of Contents
Fetching ...

Time and Memory Trade-off of KV-Cache Compression in Tensor Transformer Decoding

Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yu Tian

TL;DR

This work analyzes the memory and time costs of KV-cache compression in tensor-attention Transformer decoding. By reducing to a communication-complexity problem (Index), it derives information-theoretic lower bounds: for $d=\Omega(\log n)$, the four-cache and two-cache tensor attention schemes require $\Omega(nd)$ and $\Omega(n^2 d)$ bits of memory respectively, with the two-cache variant offering faster computation by $\Omega(n^2 d)$. It introduces SubGen4Cache and SubGen2Cache to achieve near-optimal space-accuracy trade-offs, and provides covering-number based guarantees and clusterability properties, both in the standard and low-dimensional regimes. The results illuminate intrinsic limitations and guide the development of more memory-efficient tensor attention architectures for large-scale decoding tasks.

Abstract

The key-value (KV) cache in the tensor version of transformers presents a significant bottleneck during inference. While previous work analyzes the fundamental space complexity barriers in standard attention mechanisms [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = Ω(\log n)$. Furthermore, we introduce two types of tensor attention cache and present a trade-off between time and memory for two scenarios. Overall, our work provides a theoretical foundation for us to understand the time-memory tradeoff of KV-Cache compression in tensor attention decoding and offers more perspectives in developing more memory-efficient tensor attention Transformer architectures.

Time and Memory Trade-off of KV-Cache Compression in Tensor Transformer Decoding

TL;DR

This work analyzes the memory and time costs of KV-cache compression in tensor-attention Transformer decoding. By reducing to a communication-complexity problem (Index), it derives information-theoretic lower bounds: for , the four-cache and two-cache tensor attention schemes require and bits of memory respectively, with the two-cache variant offering faster computation by . It introduces SubGen4Cache and SubGen2Cache to achieve near-optimal space-accuracy trade-offs, and provides covering-number based guarantees and clusterability properties, both in the standard and low-dimensional regimes. The results illuminate intrinsic limitations and guide the development of more memory-efficient tensor attention architectures for large-scale decoding tasks.

Abstract

The key-value (KV) cache in the tensor version of transformers presents a significant bottleneck during inference. While previous work analyzes the fundamental space complexity barriers in standard attention mechanisms [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when . Furthermore, we introduce two types of tensor attention cache and present a trade-off between time and memory for two scenarios. Overall, our work provides a theoretical foundation for us to understand the time-memory tradeoff of KV-Cache compression in tensor attention decoding and offers more perspectives in developing more memory-efficient tensor attention Transformer architectures.

Paper Structure

This paper contains 46 sections, 23 theorems, 36 equations, 1 figure.

Key Result

Lemma 3.4

If the following conditions hold: Then if $k = \Omega(\epsilon^{-2} \log (f/\delta))$, we have:

Figures (1)

  • Figure 1: Comparison of memory and computational requirements between four-cache and two-cache matrix formulations in tensor attention. The four-cache approach $(K_{1,i},K_{2,i},V_{1,i},V_{2,i})$ uses linear memory $O(id)$ but incurs $O(i^2d)$ computational cost for Kronecker products during inference. The two-cache approach $(\widetilde{K}_i, \widetilde{V}_i)$ pre-computes these products, requiring $O(i^2d)$ memory but enabling faster inference.

Theorems & Definitions (40)

  • Definition 3.1: $\oslash$ Column-wise Kronecker Product
  • Definition 3.2: Softmax
  • Definition 3.3: JL-tranform jl84, see Definition 3 in w14_book as an example
  • Lemma 3.4: Theorem 4 in w14_book
  • Lemma 3.5: Johnson-Lindenstrauss (JL) Random Projections, an immediate application of Lemma \ref{['lem:jlt_gaussian']} , w14_book
  • proof
  • Definition 3.6: Tensor Attention, a variation of as24_iclr
  • Definition 3.7
  • Definition 3.8: Tensor Attention with KV Cache, Two Cache Matrices Case
  • Remark 3.9: Memory-Speed Trade-off Between Two and Four Cache Matrices
  • ...and 30 more