Table of Contents
Fetching ...

Self-Attention as a Parametric Endofunctor: A Categorical Framework for Transformer Architectures

Charles O'Neill

TL;DR

$The$ $paper$ $develops$ $a$ $category$-$theoretic$ $framework$ $for$ $self$-$attention$, $showing$ the linear components form a parametric endomorphism in $\mathsf{Para}(\mathsf{Vect})$ and that stacking corresponds to the free monad $\mathrm{Free}(F)$ on the induced endofunctor $F$. $Positional$ $encodings$ are recast as (affine) monoid actions when additive, while sinusoidal schemes provide faithful, injective labelings with a universal property among faithful encodings. The linear parts of self-attention are shown to be equivariant under token permutations, and mechanistic interpretability circuits align with compositions of parametric 1-morphisms. The work unifies geometric, algebraic, and interpretability perspectives while clarifying how nonlinearities (softmax, layernorm) lie beyond the current linear $\mathsf{Vect}$ setting, inviting extensions to richer categorical contexts. Overall, this framework offers a principled, universal lens on transformer architecture, guiding principled design and interpretability analysis while pointing to future directions that incorporate nonlinear and variable-length aspects.$

Abstract

Self-attention mechanisms have revolutionised deep learning architectures, yet their core mathematical structures remain incompletely understood. In this work, we develop a category-theoretic framework focusing on the linear components of self-attention. Specifically, we show that the query, key, and value maps naturally define a parametric 1-morphism in the 2-category $\mathbf{Para(Vect)}$. On the underlying 1-category $\mathbf{Vect}$, these maps induce an endofunctor whose iterated composition precisely models multi-layer attention. We further prove that stacking multiple self-attention layers corresponds to constructing the free monad on this endofunctor. For positional encodings, we demonstrate that strictly additive embeddings correspond to monoid actions in an affine sense, while standard sinusoidal encodings, though not additive, retain a universal property among injective (faithful) position-preserving maps. We also establish that the linear portions of self-attention exhibit natural equivariance to permutations of input tokens, and show how the "circuits" identified in mechanistic interpretability can be interpreted as compositions of parametric 1-morphisms. This categorical perspective unifies geometric, algebraic, and interpretability-based approaches to transformer analysis, making explicit the underlying structures of attention. We restrict to linear maps throughout, deferring the treatment of nonlinearities such as softmax and layer normalisation, which require more advanced categorical constructions. Our results build on and extend recent work on category-theoretic foundations for deep learning, offering deeper insights into the algebraic structure of attention mechanisms.

Self-Attention as a Parametric Endofunctor: A Categorical Framework for Transformer Architectures

TL;DR

- -, the linear components form a parametric endomorphism in and that stacking corresponds to the free monad on the induced endofunctor . are recast as (affine) monoid actions when additive, while sinusoidal schemes provide faithful, injective labelings with a universal property among faithful encodings. The linear parts of self-attention are shown to be equivariant under token permutations, and mechanistic interpretability circuits align with compositions of parametric 1-morphisms. The work unifies geometric, algebraic, and interpretability perspectives while clarifying how nonlinearities (softmax, layernorm) lie beyond the current linear setting, inviting extensions to richer categorical contexts. Overall, this framework offers a principled, universal lens on transformer architecture, guiding principled design and interpretability analysis while pointing to future directions that incorporate nonlinear and variable-length aspects.$

Abstract

Self-attention mechanisms have revolutionised deep learning architectures, yet their core mathematical structures remain incompletely understood. In this work, we develop a category-theoretic framework focusing on the linear components of self-attention. Specifically, we show that the query, key, and value maps naturally define a parametric 1-morphism in the 2-category . On the underlying 1-category , these maps induce an endofunctor whose iterated composition precisely models multi-layer attention. We further prove that stacking multiple self-attention layers corresponds to constructing the free monad on this endofunctor. For positional encodings, we demonstrate that strictly additive embeddings correspond to monoid actions in an affine sense, while standard sinusoidal encodings, though not additive, retain a universal property among injective (faithful) position-preserving maps. We also establish that the linear portions of self-attention exhibit natural equivariance to permutations of input tokens, and show how the "circuits" identified in mechanistic interpretability can be interpreted as compositions of parametric 1-morphisms. This categorical perspective unifies geometric, algebraic, and interpretability-based approaches to transformer analysis, making explicit the underlying structures of attention. We restrict to linear maps throughout, deferring the treatment of nonlinearities such as softmax and layer normalisation, which require more advanced categorical constructions. Our results build on and extend recent work on category-theoretic foundations for deep learning, offering deeper insights into the algebraic structure of attention mechanisms.
Paper Structure (47 sections, 6 theorems, 65 equations, 2 figures)

This paper contains 47 sections, 6 theorems, 65 equations, 2 figures.

Key Result

Theorem 3.1

Let $Q,K,V \colon \mathbb{R}^d \,\to\, \mathbb{R}^{d_k}$ (or $\mathbb{R}^{d_v}$ for the value projection) be three linear maps representing the query, key, and value transformations of a single-head self-attention mechanism. For a length-$n$ sequence, the input space is $X \,=\, (\mathbb{R}^d)^n \co where Furthermore, this 1-morphism $(\mathsf{AttP},\mathsf{att})$ is stable under composition in $

Figures (2)

  • Figure 1: Conceptual overview of how the monoid $M$ (left) provides positional encodings $\bigl(p:\mathbf{BM}\to \mathsf{Vect}\bigr)$ that are added to the embedding space $\mathbf{X}^n$. The main self-attention block is formalised as a parametric endofunctor with learnable queries $(Q)$, keys $(K)$, and values $(V)$. Repeated application of this block yields the stacked self-attention layers, interpreted categorically as a free monad on $F$.
  • Figure 2: A string diagram illustrating two parametric morphisms (QK and OV) composed sequentially in $\mathsf{Para}(\mathsf{Vect})$. Each morphism has a horizontal wire (data flow) and a vertical wire (parameter space). The output of the first morphism (QK) feeds into the second (OV).

Theorems & Definitions (14)

  • Definition 2.1: Category
  • Definition 2.2: Endofunctor
  • Definition 2.3: $\mathsf{Para}(\mathsf{Vect})$ with Tensor Products
  • Theorem 3.1: Self-Attention as a Parametric 1-Morphism on $\mathsf{Vect}$
  • Theorem 3.2: Stacking as a Free Monad
  • Remark 4.1: Affine vs. Linear
  • Theorem 5.1
  • proof
  • Theorem A.1: Self-Attention as a Parametric 1-Morphism on $\mathsf{Vect}$
  • proof
  • ...and 4 more