Table of Contents
Fetching ...

ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys

Yue Niu, Saurav Prakash, Salman Avestimehr

TL;DR

ATP achieves comparable accuracy with much lower computation and memory complexity than the standard attention mechanism, and reduces complexity for other linear layers with low-rank inputs, leading to more speedup compared to prior works that solely target the attention module.

Abstract

We propose a new attention mechanism with linear complexity, ATP, that fixates \textbf{A}ttention on \textbf{T}op \textbf{P}rincipal keys, rather than on each individual token. Particularly, ATP is driven by an important observation that input sequences are typically low-rank, i.e., input sequences can be represented by a few principal bases. Therefore, instead of directly iterating over all the input tokens, ATP transforms inputs into an orthogonal space and computes attention only on the top principal bases (keys). Owing to the observed low-rank structure in input sequences, ATP is able to capture semantic relationships in input sequences with a few principal keys. Furthermore, the attention complexity is reduced from \emph{quadratic} to \emph{linear} without incurring a noticeable performance drop. ATP further reduces complexity for other linear layers with low-rank inputs, leading to more speedup compared to prior works that solely target the attention module. Our evaluations on various models (e.g., BERT and Llama) demonstrate that ATP achieves comparable accuracy with much lower computation and memory complexity than the standard attention mechanism. In particular, ATP barely loses accuracy with only $1/2$ principal keys, and only incurs around $2\%$ accuracy drops with $1/4$ principal keys.

ATP: Enabling Fast LLM Serving via Attention on Top Principal Keys

TL;DR

ATP achieves comparable accuracy with much lower computation and memory complexity than the standard attention mechanism, and reduces complexity for other linear layers with low-rank inputs, leading to more speedup compared to prior works that solely target the attention module.

Abstract

We propose a new attention mechanism with linear complexity, ATP, that fixates \textbf{A}ttention on \textbf{T}op \textbf{P}rincipal keys, rather than on each individual token. Particularly, ATP is driven by an important observation that input sequences are typically low-rank, i.e., input sequences can be represented by a few principal bases. Therefore, instead of directly iterating over all the input tokens, ATP transforms inputs into an orthogonal space and computes attention only on the top principal bases (keys). Owing to the observed low-rank structure in input sequences, ATP is able to capture semantic relationships in input sequences with a few principal keys. Furthermore, the attention complexity is reduced from \emph{quadratic} to \emph{linear} without incurring a noticeable performance drop. ATP further reduces complexity for other linear layers with low-rank inputs, leading to more speedup compared to prior works that solely target the attention module. Our evaluations on various models (e.g., BERT and Llama) demonstrate that ATP achieves comparable accuracy with much lower computation and memory complexity than the standard attention mechanism. In particular, ATP barely loses accuracy with only principal keys, and only incurs around accuracy drops with principal keys.
Paper Structure (18 sections, 9 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 18 sections, 9 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: Distribution of low-rankness of Llama-2's embedding on MMLU and BoolQ dataset, measured by ratio $\left \lceil 2^{\mu} \right \rceil / L$. Almost all sequences can be sufficiently approximated with less than half principal components without incurring error. Longer sequences exhibit a more low-rank structure.
  • Figure 2: Standard self-attention and low-rank self-attention. Low-rank self-attention share the same procedure as the standard self-attention, but with only $r$ principal keys and values.
  • Figure 3: Transformer encoder/decoder with low-rank self-attention. Input $X$ is first fed to SVD to attain the principal components, $X'$. Then, $X'$ is fed to an encoder/decoder layer with low-rank self-attention.
  • Figure 4: Actual running time of low-rank self-attention compared to the standard mechanism with different sequence lengths ($r$=128). The running time of the standard self-attention increases quadratically with the sequence length. Low-rank self-attention reduces the running time to almost linear.
  • Figure 5: Energy ratio ($\left \| X' \right \|^2_F / \left \| X \right \|^2_F$) in low-rank hidden representations. Embeddings of all three datasets exhibit highly low-rank structures, with $1/2$ principal components preserving almost all energy.
  • ...and 2 more figures

Theorems & Definitions (1)

  • Remark 4.1