Table of Contents
Fetching ...

Attention Mechanisms Through the Lens of Numerical Methods: Approximation Methods and Alternative Formulations

Michel Fabrice Serret, Alice Cortinovis, Yijun Dong, Diana Halikias, Anna Ma, Fabio Matti, Deanna Needell, Katherine J. Pearce, Elizaveta Rebrova, Disha Shur, Rudi Smith, Hai-Xiao Wang, Laura Grigori

Abstract

The attention mechanism is the computational core of modern Transformer architectures, but its quadratic complexity in the input sequence length is the bottleneck for large-scale inference. This has motivated a rapidly growing body of work aimed at accelerating attention through approximation and reformulation. In this survey, we revisit attention mechanisms through the lens of numerical analysis, with a particular emphasis on tools and perspectives from numerical linear algebra. Our goal is twofold: first, we aim to systematically review and classify fast approximation methods according to the numerical principles they exploit. These include sparsity and clustering approaches, low-rank and subspace projection techniques, randomized sketching methods, and tensor-based decompositions. We also discuss kernel-inspired reformulations of attention and recent architectural variants, such as Latent Attention, that modify the standard softmax formulation to improve efficiency. Second, by presenting these developments within a unified mathematical framework, we aim to bridge the gap between disciplines and highlight opportunities for further contributions from computational mathematics, particularly numerical linear algebra, to the design of scalable attention mechanisms.

Attention Mechanisms Through the Lens of Numerical Methods: Approximation Methods and Alternative Formulations

Abstract

The attention mechanism is the computational core of modern Transformer architectures, but its quadratic complexity in the input sequence length is the bottleneck for large-scale inference. This has motivated a rapidly growing body of work aimed at accelerating attention through approximation and reformulation. In this survey, we revisit attention mechanisms through the lens of numerical analysis, with a particular emphasis on tools and perspectives from numerical linear algebra. Our goal is twofold: first, we aim to systematically review and classify fast approximation methods according to the numerical principles they exploit. These include sparsity and clustering approaches, low-rank and subspace projection techniques, randomized sketching methods, and tensor-based decompositions. We also discuss kernel-inspired reformulations of attention and recent architectural variants, such as Latent Attention, that modify the standard softmax formulation to improve efficiency. Second, by presenting these developments within a unified mathematical framework, we aim to bridge the gap between disciplines and highlight opportunities for further contributions from computational mathematics, particularly numerical linear algebra, to the design of scalable attention mechanisms.

Paper Structure

This paper contains 71 sections, 128 equations, 14 figures, 1 table.

Figures (14)

  • Figure 1: Example of sentence embedding for a word-based tokenizer for the phrase 'the lazy dog'.
  • Figure 2: Example of an attention layer for the embeddings obtained from the tokenization example in Figure \ref{['fig:tok_embd']}. It is important to note that in this example we take the kernel to be linear for simplicity.
  • Figure 3: GPT architecture radford2018improving - each layer of the GPT Decoder contains multiple submodules, only one of which is a masked Multi-Headed Attention submodule.
  • Figure 4: Overview of approximation techniques for self-attention.
  • Figure 5: Approximate sparsity pattern of the top $20 \times 20$ block of the masked attention matrix $Z^{-1}A \in \mathbb{R}^{309 \times 309}$ corresponding to the $K$ and $Q$ matrices given by the Llama 3.2 (1B) model with the HyperAttention han2023hyperattention abstract given as input, for different choices of heads and layers. The colors represent the magnitude of the entries, with lighter colors being larger elements; each row of the matrix has been normalized to have maximum element equal to $1$. It is interesting to note that some heads in specific layers exhibit so-called attention sinksxiaoEfficientStreamingLanguage2024a, visible as columns with consistently large weights, indicating tokens that attract attention across many query positions.
  • ...and 9 more figures