Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models

Jan van den Brand; Zhao Song; Tianyi Zhou

Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models

Jan van den Brand, Zhao Song, Tianyi Zhou

TL;DR

This work formalizes the dynamic online diagonal-based normalized attention matrix vector multiplication problem (ODAMV) for LLM-style attention, where updates modify K or V and queries compute a diagonal-normalized, exponentiated attention times V. It introduces a lazy-update data-structure that achieves near-quadratic space with amortized update costs and sublinear query times, together with a recomputation scheme that preserves correctness. A conditional lower bound, derived from the Hinted MV (HMV) conjecture, indicates the proposed upper bounds are near-optimal under widely believed complexity assumptions. Overall, the paper advances understanding of dynamic attention maintenance by providing both a concrete upper-bound data structure and a HMv-based hardness framework for subquadratic dynamic maintenance in large-scale transformers. These results illuminate the computational trade-offs in maintaining attention computations as model inputs and parameters evolve during training or inference.

Abstract

Large language models (LLMs) have made fundamental changes in human life. The attention scheme is one of the key components over all the LLMs, such as BERT, GPT-1, Transformers, GPT-2, 3, 3.5 and 4. Inspired by previous theoretical study of static version of the attention multiplication problem [Zandieh, Han, Daliri, and Karbasi arXiv 2023, Alman and Song arXiv 2023]. In this work, we formally define a dynamic version of attention matrix multiplication problem. There are matrices $Q,K, V \in \mathbb{R}^{n \times d}$, they represent query, key and value in LLMs. In each iteration we update one entry in $K$ or $V$. In the query stage, we receive $(i,j) \in [n] \times [d]$ as input, and want to answer $(D^{-1} A V)_{i,j}$, where $A:=\exp(QK^\top) \in \mathbb{R}^{n \times n}$ is a square matrix and $D := \mathrm{diag}(A {\bf 1}_n) \in \mathbb{R}^{n \times n}$ is a diagonal matrix. Here ${\bf 1}_n$ denote a length-$n$ vector that all the entries are ones. We provide two results: an algorithm and a conditional lower bound. $\bullet$ On one hand, inspired by the lazy update idea from [Demetrescu and Italiano FOCS 2000, Sankowski FOCS 2004, Cohen, Lee and Song STOC 2019, Brand SODA 2020], we provide a data-structure that uses $O(n^{ω(1,1,τ)-τ})$ amortized update time, and $O(n^{1+τ})$ worst-case query time. $\bullet$ On the other hand, show that unless the hinted matrix vector multiplication conjecture [Brand, Nanongkai and Saranurak FOCS 2019] is false, there is no algorithm that can use both $O(n^{ω(1,1,τ) - τ- Ω(1)})$ amortized update time, and $O(n^{1+τ-Ω(1)})$ worst query time. In conclusion, our algorithmic result is conditionally optimal unless hinted matrix vector multiplication conjecture is false.

Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models

TL;DR

Abstract

, they represent query, key and value in LLMs. In each iteration we update one entry in

. In the query stage, we receive

as input, and want to answer

, where

is a square matrix and

is a diagonal matrix. Here

denote a length-

vector that all the entries are ones. We provide two results: an algorithm and a conditional lower bound.

On one hand, inspired by the lazy update idea from [Demetrescu and Italiano FOCS 2000, Sankowski FOCS 2004, Cohen, Lee and Song STOC 2019, Brand SODA 2020], we provide a data-structure that uses

amortized update time, and

worst-case query time.

On the other hand, show that unless the hinted matrix vector multiplication conjecture [Brand, Nanongkai and Saranurak FOCS 2019] is false, there is no algorithm that can use both

amortized update time, and

worst query time. In conclusion, our algorithmic result is conditionally optimal unless hinted matrix vector multiplication conjecture is false.

Paper Structure (25 sections, 12 theorems, 37 equations, 2 figures, 4 algorithms)

This paper contains 25 sections, 12 theorems, 37 equations, 2 figures, 4 algorithms.

Introduction
Our Results
Related Work
Static Attention Computation
Transformer Theory
Dynamic Maintenance
Roadmap
Preliminary
Technique Overview
Algorithm
Problem Formulation
Lazy Update
Re-compute
Fast Query
Hardness
...and 10 more sections

Key Result

Theorem 1.3

For any constant $a \in (0,1]$. Let $d = O(n)$. There is a dynamic data structure that uses $O(n^2 )$ space and supports the following operations:

Figures (2)

Figure 1: Computation of the attention matrix $A= \exp(Q K^\top)$ and the diagonal matrix $D \in \mathbb{R}^{n \times n}$ (defined in Definition \ref{['def:att_mul']}). Here $\exp()$ is the entry-wise function.
Figure 2: Computation of the target matrix $\mathop{\mathrm{\mathsf{Att}}}\nolimits(Q,K,V) = D^{-1} A V$ (defined in Definition \ref{['def:att_mul']})

Theorems & Definitions (40)

Definition 1.1: Static Attention Multiplication
Definition 1.2: $\mathsf{ODAMV}(n,d)$
Theorem 1.3: Upper bound, informal version of Theorem \ref{['thm:fast']}
Lemma 1.4: Lower bound, informal version of Lemma \ref{['lem:lowerbound_D']}
Definition 2.2
Definition 2.3
Conjecture 3.1: Hinted MV ($\mathsf{HMV}$), bns19
Theorem 4.1: Main algorithm, formal version of Theorem \ref{['thm:fast_informal']}
Remark 4.2
Lemma 4.3: Init
...and 30 more

Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models

TL;DR

Abstract

Algorithm and Hardness for Dynamic Attention Maintenance in Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (2)

Theorems & Definitions (40)