Table of Contents
Fetching ...

Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models

Jan van den Brand, Zhao Song, Tianyi Zhou

TL;DR

This work formalizes the dynamic online diagonal-based normalized attention matrix vector multiplication problem (ODAMV) for LLM-style attention, where updates modify K or V and queries compute a diagonal-normalized, exponentiated attention times V. It introduces a lazy-update data-structure that achieves near-quadratic space with amortized update costs and sublinear query times, together with a recomputation scheme that preserves correctness. A conditional lower bound, derived from the Hinted MV (HMV) conjecture, indicates the proposed upper bounds are near-optimal under widely believed complexity assumptions. Overall, the paper advances understanding of dynamic attention maintenance by providing both a concrete upper-bound data structure and a HMv-based hardness framework for subquadratic dynamic maintenance in large-scale transformers. These results illuminate the computational trade-offs in maintaining attention computations as model inputs and parameters evolve during training or inference.

Abstract

Large language models (LLMs) have made fundamental changes in human life. The attention scheme is one of the key components over all the LLMs, such as BERT, GPT-1, Transformers, GPT-2, 3, 3.5 and 4. Inspired by previous theoretical study of static version of the attention multiplication problem [Zandieh, Han, Daliri, and Karbasi arXiv 2023, Alman and Song arXiv 2023]. In this work, we formally define a dynamic version of attention matrix multiplication problem. There are matrices $Q,K, V \in \mathbb{R}^{n \times d}$, they represent query, key and value in LLMs. In each iteration we update one entry in $K$ or $V$. In the query stage, we receive $(i,j) \in [n] \times [d]$ as input, and want to answer $(D^{-1} A V)_{i,j}$, where $A:=\exp(QK^\top) \in \mathbb{R}^{n \times n}$ is a square matrix and $D := \mathrm{diag}(A {\bf 1}_n) \in \mathbb{R}^{n \times n}$ is a diagonal matrix. Here ${\bf 1}_n$ denote a length-$n$ vector that all the entries are ones. We provide two results: an algorithm and a conditional lower bound. $\bullet$ On one hand, inspired by the lazy update idea from [Demetrescu and Italiano FOCS 2000, Sankowski FOCS 2004, Cohen, Lee and Song STOC 2019, Brand SODA 2020], we provide a data-structure that uses $O(n^{ω(1,1,τ)-τ})$ amortized update time, and $O(n^{1+τ})$ worst-case query time. $\bullet$ On the other hand, show that unless the hinted matrix vector multiplication conjecture [Brand, Nanongkai and Saranurak FOCS 2019] is false, there is no algorithm that can use both $O(n^{ω(1,1,τ) - τ- Ω(1)})$ amortized update time, and $O(n^{1+τ-Ω(1)})$ worst query time. In conclusion, our algorithmic result is conditionally optimal unless hinted matrix vector multiplication conjecture is false.

Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models

TL;DR

This work formalizes the dynamic online diagonal-based normalized attention matrix vector multiplication problem (ODAMV) for LLM-style attention, where updates modify K or V and queries compute a diagonal-normalized, exponentiated attention times V. It introduces a lazy-update data-structure that achieves near-quadratic space with amortized update costs and sublinear query times, together with a recomputation scheme that preserves correctness. A conditional lower bound, derived from the Hinted MV (HMV) conjecture, indicates the proposed upper bounds are near-optimal under widely believed complexity assumptions. Overall, the paper advances understanding of dynamic attention maintenance by providing both a concrete upper-bound data structure and a HMv-based hardness framework for subquadratic dynamic maintenance in large-scale transformers. These results illuminate the computational trade-offs in maintaining attention computations as model inputs and parameters evolve during training or inference.

Abstract

Large language models (LLMs) have made fundamental changes in human life. The attention scheme is one of the key components over all the LLMs, such as BERT, GPT-1, Transformers, GPT-2, 3, 3.5 and 4. Inspired by previous theoretical study of static version of the attention multiplication problem [Zandieh, Han, Daliri, and Karbasi arXiv 2023, Alman and Song arXiv 2023]. In this work, we formally define a dynamic version of attention matrix multiplication problem. There are matrices , they represent query, key and value in LLMs. In each iteration we update one entry in or . In the query stage, we receive as input, and want to answer , where is a square matrix and is a diagonal matrix. Here denote a length- vector that all the entries are ones. We provide two results: an algorithm and a conditional lower bound. On one hand, inspired by the lazy update idea from [Demetrescu and Italiano FOCS 2000, Sankowski FOCS 2004, Cohen, Lee and Song STOC 2019, Brand SODA 2020], we provide a data-structure that uses amortized update time, and worst-case query time. On the other hand, show that unless the hinted matrix vector multiplication conjecture [Brand, Nanongkai and Saranurak FOCS 2019] is false, there is no algorithm that can use both amortized update time, and worst query time. In conclusion, our algorithmic result is conditionally optimal unless hinted matrix vector multiplication conjecture is false.
Paper Structure (25 sections, 12 theorems, 37 equations, 2 figures, 4 algorithms)

This paper contains 25 sections, 12 theorems, 37 equations, 2 figures, 4 algorithms.

Key Result

Theorem 1.3

For any constant $a \in (0,1]$. Let $d = O(n)$. There is a dynamic data structure that uses $O(n^2 )$ space and supports the following operations:

Figures (2)

  • Figure 1: Computation of the attention matrix $A= \exp(Q K^\top)$ and the diagonal matrix $D \in \mathbb{R}^{n \times n}$ (defined in Definition \ref{['def:att_mul']}). Here $\exp()$ is the entry-wise function.
  • Figure 2: Computation of the target matrix $\mathop{\mathrm{\mathsf{Att}}}\nolimits(Q,K,V) = D^{-1} A V$ (defined in Definition \ref{['def:att_mul']})

Theorems & Definitions (40)

  • Definition 1.1: Static Attention Multiplication
  • Definition 1.2: $\mathsf{ODAMV}(n,d)$
  • Theorem 1.3: Upper bound, informal version of Theorem \ref{['thm:fast']}
  • Lemma 1.4: Lower bound, informal version of Lemma \ref{['lem:lowerbound_D']}
  • Definition 2.2
  • Definition 2.3
  • Conjecture 3.1: Hinted MV ($\mathsf{HMV}$), bns19
  • Theorem 4.1: Main algorithm, formal version of Theorem \ref{['thm:fast_informal']}
  • Remark 4.2
  • Lemma 4.3: Init
  • ...and 30 more