On the Anatomy of Attention

Nikhil Khatri; Tuomas Laakkonen; Jonathon Liu; Vincent Wang-Maścianica

On the Anatomy of Attention

Nikhil Khatri, Tuomas Laakkonen, Jonathon Liu, Vincent Wang-Maścianica

TL;DR

On the Anatomy of Attention develops a category-theoretic, diagrammatic framework to relate and reason about deep learning architectures, with a focus on attention. It introduces SIMD-decorated string diagrams and a rewrite system based on universal approximation to connect folklore evolution (Bahdanau to Vaswani) and the linearised attention variant, while providing a taxonomy of attention variants. Empirically, the authors show that performance across 14 distinct attention mechanisms is broadly comparable on a language modelling task, suggesting that the exact attention structure may not be the sole determinant of Transformer-like performance. The framework offers a principled, scalable way to analyze and generate architectural variants and could guide future explorations beyond conventional attention designs.

Abstract

We introduce a category-theoretic diagrammatic formalism in order to systematically relate and reason about machine learning models. Our diagrams present architectures intuitively but without loss of essential detail, where natural relationships between models are captured by graphical transformations, and important differences and similarities can be identified at a glance. In this paper, we focus on attention mechanisms: translating folklore into mathematical derivations, and constructing a taxonomy of attention variants in the literature. As a first example of an empirical investigation underpinned by our formalism, we identify recurring anatomical components of attention, which we exhaustively recombine to explore a space of variations on the attention mechanism.

On the Anatomy of Attention

TL;DR

Abstract

Paper Structure (28 sections, 13 theorems, 53 equations, 4 figures)

This paper contains 28 sections, 13 theorems, 53 equations, 4 figures.

Introduction
Formally depicting architectures
The evolution of architectures via rewrites
Bahdanau et al. to Vaswani et al.
Vaswani et al. to linearised attention
Taxonomizing architectures
Empirically exploring a space of attention mechanisms
Discussion
Formal semantics
Tick-notation for arbitrary dimensions
Diagrammatic rewriting: universal approximation
Confluence
On tensor-notation in machine learning and related work
SIMD notation
Free tensoring of a PROP
...and 13 more sections

Key Result

Proposition A.11

Figures (4)

Figure 1: The taxonomy generated by starting with a 'primordial attention' mechanism and applying specialisations and expressivity rewrites. Details of the notation and rewrites are given in Section \ref{['sec:tax']}.
Figure 2: (a) The results of training Transformer models based on the 14 attention variants identified above. They were trained ab initio on word-level language modelling of the Penn Treebank corpus - all models have four attention heads per layer and an embedding dimension of 512. We used the learning-rate scheduler given by vaswani2017attention, with initial learning rate tuned per-model. Each line on the plot shows the test-set perplexity of one model for between one and five layers, as compared to total trainable parameter count. The models are coloured according to whether they contain only linear attention generators (magenta), only dot-product attention generators (cyan), or both (orange). (b) The same with the results from radford2019languagebrown2020language for scale. Note that PPL is an exponential scale, so differences matter less as the PPL value increases.
Figure 3: The attention mechanisms generated by exhaustively recombining the given generators, after removing redundant models using the criteria described above. Note that M0 corresponds precisely to the linear attention mechanism presented in katharopoulos2020transformers, and M1 corresponds to scaled dot-product attention as presented in vaswani2017attention.
Figure 4: Relative running times and GPU memory required for each of the 14 proposed attention mechanisms, for models of between one and five layers, using the hyperparameters given in Appendix \ref{['app:experiment']}. The data is scaled so that the performance of M1 (scaled-dot-product attention as in vaswani2017attention) is fixed to one (independently for each number of layers), in order to be somewhat platform-independent. All experiments were performed on NVIDIA A30 GPUs. The reported error bars are one standard deviation.

Theorems & Definitions (50)

Example 2.1: The vanilla transformer
Example 2.2: The original self-attention
Definition 3.1: Expressive Reductions
Definition A.1: PROP
Example A.2
Definition A.3: Coloured PROP
Definition A.4: Cartesian PROP
Definition A.6: Reshaping
Example A.7
Definition A.8: $\mathbb{F}(\mathbb{R}^\otimes)$
...and 40 more

On the Anatomy of Attention

TL;DR

Abstract

On the Anatomy of Attention

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (4)

Theorems & Definitions (50)