Compression Barriers for Autoregressive Transformers
Themistoklis Haris, Krzysztof Onak
TL;DR
This work proves that sublinear space for autoregressive Transformer token generation is impossible in the worst case when embedding dimension $d$ is not sub-logarithmic in the token count $n$, requiring $Ω(d\min\{n,e^{d}\})$ bits. It leverages a reduction from the Index problem and Johnson-Lindenstrauss projections to establish a tight lower bound, and shows a complementary low-dimension regime where sublinear space is achievable (via the SubGen approach) using covering-number arguments. The paper also analyzes sparsity-based strategies, revealing that unstructured sparsity alone is insufficient, but that a sliding-window KV cache with a novel estimator achieves $O(dW)$ space with matching lower bounds, thereby permitting sublinear space under specific structural patterns. Additionally, it derives a non-adaptive time complexity lower bound of $Ω(nd)$ for token generation, while noting adaptive methods can circumvent this barrier. Overall, the results delineate the fundamental limits of KV-cache compression and highlight the necessity of data-driven structure or distributional assumptions for practical space savings. $Z_i$ denotes the attention output, and the core lower bounds are expressed as $Ω(d\min\{n,e^{d}\})$ and $Ω(dW)$ in the sliding-window setting.
Abstract
A key limitation of autoregressive Transformers is the large memory needed at inference-time to cache all previous key-value (KV) embeddings. Prior works address this by compressing the KV cache, but often assume specific structural properties of the embeddings. This raises the following natural question: Can truly sublinear space utilization be achieved without such assumptions? In this work, we answer this question in the negative. Any algorithm for attention-based token generation must use $Θ(nd)$ space, where $n$ is the number of tokens generated so far and $d = Ω(\log n)$ is the dimension of the KV embeddings. Our proof involves a reduction from a classic communication complexity problem and uses a randomized construction that leverages properties of projections in the spirit of the Johnson-Linderstrauss lemma. For the low-dimensional regime $d = o(\log n)$, we show that any algorithm requires $Ω(d\cdot e^d)$ space and prove, using tight bounds on covering numbers, that SubGen, proposed by Zandieh, Han, Mirrokni and Karbasi, matches this bound. Further, we investigate how sparsity assumptions enable token generation in truly sublinear space, presenting impossibility results and proposing a new KV cache compression algorithm for sliding window attention when the value cache outside the window is unmasked. Finally, we analyze token generation's time complexity, using an indistinguishability argument to prove that no non-adaptive algorithm can compute attention online in sublinear time for all tokens.
