Time and Memory Trade-off of KV-Cache Compression in Tensor Transformer Decoding
Yifang Chen, Xiaoyu Li, Yingyu Liang, Zhenmei Shi, Zhao Song, Yu Tian
TL;DR
This work analyzes the memory and time costs of KV-cache compression in tensor-attention Transformer decoding. By reducing to a communication-complexity problem (Index), it derives information-theoretic lower bounds: for $d=\Omega(\log n)$, the four-cache and two-cache tensor attention schemes require $\Omega(nd)$ and $\Omega(n^2 d)$ bits of memory respectively, with the two-cache variant offering faster computation by $\Omega(n^2 d)$. It introduces SubGen4Cache and SubGen2Cache to achieve near-optimal space-accuracy trade-offs, and provides covering-number based guarantees and clusterability properties, both in the standard and low-dimensional regimes. The results illuminate intrinsic limitations and guide the development of more memory-efficient tensor attention architectures for large-scale decoding tasks.
Abstract
The key-value (KV) cache in the tensor version of transformers presents a significant bottleneck during inference. While previous work analyzes the fundamental space complexity barriers in standard attention mechanisms [Haris and Onak, 2025], our work generalizes the space complexity barriers result to tensor attention version. Our theoretical contributions rely on a reduction from communication complexity and deduce the memory lower bound for tensor-structured attention mechanisms when $d = Ω(\log n)$. Furthermore, we introduce two types of tensor attention cache and present a trade-off between time and memory for two scenarios. Overall, our work provides a theoretical foundation for us to understand the time-memory tradeoff of KV-Cache compression in tensor attention decoding and offers more perspectives in developing more memory-efficient tensor attention Transformer architectures.
