One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

Raghav Addanki; Chenyang Li; Zhao Song; Chiwun Yang

One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

Raghav Addanki, Chenyang Li, Zhao Song, Chiwun Yang

TL;DR

This work tackles the memory bottleneck of attention with super-long contexts by introducing a one-pass streaming algorithm that operates in sublinear space $o(n)$ for self-attention with context length $n$ when the feature dimension $d=O(\log n)$. It combines a polynomial-method low-rank factorization $U_1U_2^T$ with sketching ($\Phi$, $\Psi$) and sparse recovery to produce a $k$-sparse approximation to each column of the attention output, achieving an error bound $\| T_i - y_i \|_2 \le (1+\varepsilon_1) \min_{k\text{-sparse } y'} \| y' - y_i \|_2 + \varepsilon_2$ with high probability. The main result formalizes the space and time guarantees, showing decoding time $O(\varepsilon_1^{-1} k n^{o(1)})$ and success probability $0.99$, while only reading the data in a single pass. The approach enables memory-efficient deployment of transformer-based models on streaming long-text inputs and provides a pathway toward scalable, long-context processing in practical LLM systems and AGI research.

Abstract

Attention computation takes both the time complexity of $O(n^2)$ and the space complexity of $O(n^2)$ simultaneously, which makes deploying Large Language Models (LLMs) in streaming applications that involve long contexts requiring substantial computational resources. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length $n$ is much greater than 128K ($n \gg 2^d$). Considering a single-layer self-attention with Query, Key, and Value matrices $Q, K, V \in \mathbb{R}^{n \times d}$, the polynomial method approximates the attention output $T \in \mathbb{R}^{n \times d}$. It accomplishes this by constructing $U_1, U_2 \in \mathbb{R}^{n \times t}$ to expedite attention ${\sf Attn}(Q, K, V)$ computation within $n^{1+o(1)}$ time executions. Despite this, computing the approximated attention matrix $U_1U_2^\top \in \mathbb{R}^{n \times n}$ still necessitates $O(n^2)$ space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in a streaming fashion. This method employs sublinear space $o(n)$ to store three sketch matrices, alleviating the need for exact $K, V$ storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length $n$ increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.

One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

TL;DR

This work tackles the memory bottleneck of attention with super-long contexts by introducing a one-pass streaming algorithm that operates in sublinear space

for self-attention with context length

when the feature dimension

. It combines a polynomial-method low-rank factorization

with sketching (

) and sparse recovery to produce a

-sparse approximation to each column of the attention output, achieving an error bound

with high probability. The main result formalizes the space and time guarantees, showing decoding time

and success probability

, while only reading the data in a single pass. The approach enables memory-efficient deployment of transformer-based models on streaming long-text inputs and provides a pathway toward scalable, long-context processing in practical LLM systems and AGI research.

Abstract

Attention computation takes both the time complexity of

and the space complexity of

simultaneously, which makes deploying Large Language Models (LLMs) in streaming applications that involve long contexts requiring substantial computational resources. In recent OpenAI DevDay (Nov 6, 2023), OpenAI released a new model that is able to support a 128K-long document, in our paper, we focus on the memory-efficient issue when context length

is much greater than 128K (

). Considering a single-layer self-attention with Query, Key, and Value matrices

, the polynomial method approximates the attention output

. It accomplishes this by constructing

to expedite attention

computation within

time executions. Despite this, computing the approximated attention matrix

still necessitates

space, leading to significant memory usage. In response to these challenges, we introduce a new algorithm that only reads one pass of the data in a streaming fashion. This method employs sublinear space

to store three sketch matrices, alleviating the need for exact

storage. Notably, our algorithm exhibits exceptional memory-efficient performance with super-long tokens. As the token length

increases, our error guarantee diminishes while the memory usage remains nearly constant. This unique attribute underscores the potential of our technique in efficiently handling LLMs in streaming applications.

Paper Structure (15 sections, 7 theorems, 17 equations, 1 figure)

This paper contains 15 sections, 7 theorems, 17 equations, 1 figure.

Introduction
Our Result
Related Work
Attention Theory
Streaming Algorithm
Improving LLMs’ Utilization of Long Text
Preliminary
Notations.
Polynomial Method
Sketching Matrices
Approximate Matrix Product
Sparse Recovery
Analysis
A General Result
Conclusion

Key Result

Theorem 1.3

There is a one pass streaming algorithm (Algorithm alg:multiple) that reads $Q,K,V \in \mathbb{R}^{n \times d}$ with $d= O(\log n)$, uses $O( \epsilon_1^{-1} k n^{o(1)} + \epsilon_2^{-2} n^{o(1)})$ spaces and outputs a matrix $T \in \mathbb{R}^{n \times d}$ such that

Figures (1)

Figure 1: Comparison between our method and previous works. On the left: vanilla attention computation vsp+17; Middle: fast attention by polynomial method as23; On the right: one pass algorithm (ours).

Theorems & Definitions (15)

Definition 1.1: Static Attention Approximation without Space Requirement as23
Definition 1.2: Streaming Attention Approximation with Sublinear in $n$ Space
Theorem 1.3: Main Result, informal version of Theorem \ref{['thm:formal']}
Lemma 3.1: Error Approximation, Lemma 3.6 in as23
Definition 3.2: $k$-wise independence
Definition 3.3: Random Gaussian matrix
Definition 3.4: AMS sketch matrix ams99
Lemma 3.5: Johnson–Lindenstrauss lemma, folklore, jl84
Lemma 3.6
proof
...and 5 more

One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

TL;DR

Abstract

One Pass Streaming Algorithm for Super Long Token Attention Approximation in Sublinear Space

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (1)

Theorems & Definitions (15)