Table of Contents
Fetching ...

On the Anatomy of Attention

Nikhil Khatri, Tuomas Laakkonen, Jonathon Liu, Vincent Wang-Maścianica

TL;DR

On the Anatomy of Attention develops a category-theoretic, diagrammatic framework to relate and reason about deep learning architectures, with a focus on attention. It introduces SIMD-decorated string diagrams and a rewrite system based on universal approximation to connect folklore evolution (Bahdanau to Vaswani) and the linearised attention variant, while providing a taxonomy of attention variants. Empirically, the authors show that performance across 14 distinct attention mechanisms is broadly comparable on a language modelling task, suggesting that the exact attention structure may not be the sole determinant of Transformer-like performance. The framework offers a principled, scalable way to analyze and generate architectural variants and could guide future explorations beyond conventional attention designs.

Abstract

We introduce a category-theoretic diagrammatic formalism in order to systematically relate and reason about machine learning models. Our diagrams present architectures intuitively but without loss of essential detail, where natural relationships between models are captured by graphical transformations, and important differences and similarities can be identified at a glance. In this paper, we focus on attention mechanisms: translating folklore into mathematical derivations, and constructing a taxonomy of attention variants in the literature. As a first example of an empirical investigation underpinned by our formalism, we identify recurring anatomical components of attention, which we exhaustively recombine to explore a space of variations on the attention mechanism.

On the Anatomy of Attention

TL;DR

On the Anatomy of Attention develops a category-theoretic, diagrammatic framework to relate and reason about deep learning architectures, with a focus on attention. It introduces SIMD-decorated string diagrams and a rewrite system based on universal approximation to connect folklore evolution (Bahdanau to Vaswani) and the linearised attention variant, while providing a taxonomy of attention variants. Empirically, the authors show that performance across 14 distinct attention mechanisms is broadly comparable on a language modelling task, suggesting that the exact attention structure may not be the sole determinant of Transformer-like performance. The framework offers a principled, scalable way to analyze and generate architectural variants and could guide future explorations beyond conventional attention designs.

Abstract

We introduce a category-theoretic diagrammatic formalism in order to systematically relate and reason about machine learning models. Our diagrams present architectures intuitively but without loss of essential detail, where natural relationships between models are captured by graphical transformations, and important differences and similarities can be identified at a glance. In this paper, we focus on attention mechanisms: translating folklore into mathematical derivations, and constructing a taxonomy of attention variants in the literature. As a first example of an empirical investigation underpinned by our formalism, we identify recurring anatomical components of attention, which we exhaustively recombine to explore a space of variations on the attention mechanism.
Paper Structure (28 sections, 13 theorems, 53 equations, 4 figures)

This paper contains 28 sections, 13 theorems, 53 equations, 4 figures.

Key Result

Proposition A.11

Figures (4)

  • Figure 1: The taxonomy generated by starting with a 'primordial attention' mechanism and applying specialisations and expressivity rewrites. Details of the notation and rewrites are given in Section \ref{['sec:tax']}.
  • Figure 2: (a) The results of training Transformer models based on the 14 attention variants identified above. They were trained ab initio on word-level language modelling of the Penn Treebank corpus - all models have four attention heads per layer and an embedding dimension of 512. We used the learning-rate scheduler given by vaswani2017attention, with initial learning rate tuned per-model. Each line on the plot shows the test-set perplexity of one model for between one and five layers, as compared to total trainable parameter count. The models are coloured according to whether they contain only linear attention generators (magenta), only dot-product attention generators (cyan), or both (orange). (b) The same with the results from radford2019languagebrown2020language for scale. Note that PPL is an exponential scale, so differences matter less as the PPL value increases.
  • Figure 3: The attention mechanisms generated by exhaustively recombining the given generators, after removing redundant models using the criteria described above. Note that M0 corresponds precisely to the linear attention mechanism presented in katharopoulos2020transformers, and M1 corresponds to scaled dot-product attention as presented in vaswani2017attention.
  • Figure 4: Relative running times and GPU memory required for each of the 14 proposed attention mechanisms, for models of between one and five layers, using the hyperparameters given in Appendix \ref{['app:experiment']}. The data is scaled so that the performance of M1 (scaled-dot-product attention as in vaswani2017attention) is fixed to one (independently for each number of layers), in order to be somewhat platform-independent. All experiments were performed on NVIDIA A30 GPUs. The reported error bars are one standard deviation.

Theorems & Definitions (50)

  • Example 2.1: The vanilla transformer
  • Example 2.2: The original self-attention
  • Definition 3.1: Expressive Reductions
  • Definition A.1: PROP
  • Example A.2
  • Definition A.3: Coloured PROP
  • Definition A.4: Cartesian PROP
  • Definition A.6: Reshaping
  • Example A.7
  • Definition A.8: $\mathbb{F}(\mathbb{R}^\otimes)$
  • ...and 40 more