SubGen: Token Generation in Sublinear Time and Memory

Amir Zandieh; Insu Han; Vahab Mirrokni; Amin Karbasi

SubGen: Token Generation in Sublinear Time and Memory

Amir Zandieh, Insu Han, Vahab Mirrokni, Amin Karbasi

TL;DR

SubGen addresses the linear growth of memory and compute in autoregressive decoding caused by KV caching in transformers. It presents a streaming attention framework that achieves sublinear memory and time in the context length $n$ by (i) maintaining a sublinear set of key-value samples via reservoir sampling, and (ii) approximating the softmax denominator using a clustering-based data structure under the assumption that the keys are $(m,oldsymbol{\ delta})$-clusterable with $m = o(n)$. The authors prove that, for $t = oldsymbol{igO}(oldsymbol{rac{1}{oldsymbol{ u}^2}} e^{2oldsymbol{ u}oldsymbol r} oldsymbol{ ext{log}}~n)$ and $s = oldsymbol{igO}(oldsymbol{rac{1}{oldsymbol{ u}^2}} d)$, the estimator ${oldsymbol z}_n$ satisfies $ig Vert {oldsymbol z}_n - ext{Attn}(oldsymbol q_n,oldsymbol K_n,oldsymbol V_n) ig Vert_2 \\le oldsymbol{ u} ig Vert ext{softmax}(oldsymbol K_n oldsymbol q_n) ig Vert_2 ig Vert oldsymbol V_n ig Vert_{op}$ with high probability. Under $m = o(n)$ this yields memory and runtime $oldsymbol{O}(d (m t + s)) = oldsymbol{O}(d n^{1-oldsymbol{ } )}$, i.e., sublinear in context length. Empirically, SubGen outperforms prior KV-cache compression methods on long-context QA and line-retrieval tasks, illustrating practical impact for efficient long-context generation.

Abstract

Despite the significant success of large language models (LLMs), their extensive memory requirements pose challenges for deploying them in long-context token generation. The substantial memory footprint of LLM decoders arises from the necessity to store all previous tokens in the attention module, a requirement imposed by key-value (KV) caching. In this work, our focus is on developing an efficient compression technique for the KV cache. Empirical evidence indicates a significant clustering tendency within key embeddings in the attention module. Building on this key insight, we have devised a novel caching method with sublinear complexity, employing online clustering on key tokens and online $\ell_2$ sampling on values. The result is a provably accurate and efficient attention decoding algorithm, termed SubGen. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach. Empirical evaluations on long-context question-answering tasks demonstrate that SubGen significantly outperforms existing and state-of-the-art KV cache compression methods in terms of performance and efficiency.

SubGen: Token Generation in Sublinear Time and Memory

TL;DR

by (i) maintaining a sublinear set of key-value samples via reservoir sampling, and (ii) approximating the softmax denominator using a clustering-based data structure under the assumption that the keys are

-clusterable with

. The authors prove that, for

and

, the estimator

satisfies

with high probability. Under

this yields memory and runtime

, i.e., sublinear in context length. Empirically, SubGen outperforms prior KV-cache compression methods on long-context QA and line-retrieval tasks, illustrating practical impact for efficient long-context generation.

Abstract

sampling on values. The result is a provably accurate and efficient attention decoding algorithm, termed SubGen. Not only does this algorithm ensure a sublinear memory footprint and sublinear time complexity, but we also establish a tight error bound for our approach. Empirical evaluations on long-context question-answering tasks demonstrate that SubGen significantly outperforms existing and state-of-the-art KV cache compression methods in terms of performance and efficiency.

Paper Structure (16 sections, 4 theorems, 13 equations, 1 figure, 1 table, 1 algorithm)

This paper contains 16 sections, 4 theorems, 13 equations, 1 figure, 1 table, 1 algorithm.

Introduction
Related Work
Streaming Attention Problem
Overview of Contributions
Sublinear Time and Memory Algorithm
Streaming Attention Data Structure
Matrix Product Data Structure
Softmax Normalizer (Partition Function) DS
Case 1: $\left\| {\bm x}_{i^*} - {\bm k}_{n+1} \right\|_2 \le \delta$.
Case 2: $\left\| {\bm x}_{i^*} - {\bm k}_{n+1} \right\|_2 > \delta$.
Streaming Attention: Main Theorem
Memory and Runtime.
Experiments
Ablation Study on Clusterability
End-to-end Text Generation
...and 1 more sections

Key Result

Lemma 1

For any positive integer $s$, at any iteration $n$ of the stream in alg_stresm_attn_ds the following properties are maintained:

Figures (1)

Figure 1: A t-SNE plot of cached keys (first row) and values (second row) embeddings over $1024$ timesteps from Llama2-7B using MT Bench dataset. We pick $\ell$-layer where $\ell \in \{0,7,15,23,31\}$ and head IDs are chosen uniformly at random. Key embeddings are more clusterable than value ones. The green dots represent the centers from the greedy k-center algorithm dyer1985simple where k=$16$.

Theorems & Definitions (8)

Definition 1: Clusterability
Lemma 1: Correctness of UpdateMatrixProduct
proof
Lemma 2: Correctness of UpdateSoftmaxNormalizer
proof
Theorem 1: Efficiency and Correctness of \ref{['alg_stresm_attn_ds']}
proof
Corollary 1

SubGen: Token Generation in Sublinear Time and Memory

TL;DR

Abstract

SubGen: Token Generation in Sublinear Time and Memory

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (8)