Table of Contents
Fetching ...

Geometry of Lightning Self-Attention: Identifiability and Dimension

Nathan W. Henry, Giovanni Luca Marchetti, Kathlén Kohn

TL;DR

The paper tackles the problem of understanding the geometric structure of function spaces defined by lightning self-attention, a polynomial, unnormalized variant of attention. It employs algebraic geometry to characterize fibers of the parametrization $W\mapsto\varphi_W$, enabling exact dimension calculations for the neuromanifold and revealing symmetries that drive identifiability and training dynamics. Key contributions include a complete description of generic fibers for single-layer lightning self-attention, a dimension formula for the neuromanifold in deep architectures under bottleneck assumptions, and a detailed singularity/boundary analysis for the single-layer case, along with conjectures and numerical validation for normalized and deep traditional self-attention. These results illuminate sample complexity via dimension, expose invariances shaping optimization, and lay groundwork for applying algebraic-geometric methods to attention-based models in neural networks.

Abstract

We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case.

Geometry of Lightning Self-Attention: Identifiability and Dimension

TL;DR

The paper tackles the problem of understanding the geometric structure of function spaces defined by lightning self-attention, a polynomial, unnormalized variant of attention. It employs algebraic geometry to characterize fibers of the parametrization , enabling exact dimension calculations for the neuromanifold and revealing symmetries that drive identifiability and training dynamics. Key contributions include a complete description of generic fibers for single-layer lightning self-attention, a dimension formula for the neuromanifold in deep architectures under bottleneck assumptions, and a detailed singularity/boundary analysis for the single-layer case, along with conjectures and numerical validation for normalized and deep traditional self-attention. These results illuminate sample complexity via dimension, expose invariances shaping optimization, and lay groundwork for applying algebraic-geometric methods to attention-based models in neural networks.

Abstract

We consider function spaces defined by self-attention networks without normalization, and theoretically analyze their geometry. Since these networks are polynomial, we rely on tools from algebraic geometry. In particular, we study the identifiability of deep attention by providing a description of the generic fibers of the parametrization for an arbitrary number of layers and, as a consequence, compute the dimension of the function space. Additionally, for a single-layer model, we characterize the singular and boundary points. Finally, we formulate a conjectural extension of our results to normalized self-attention networks, prove it for a single layer, and numerically verify it in the deep case.
Paper Structure (18 sections, 15 theorems, 36 equations, 4 figures)

This paper contains 18 sections, 15 theorems, 36 equations, 4 figures.

Key Result

Lemma 3.1

Suppose that $A = K^\top Q = K'^\top Q' = A'$ and that $\textnormal{rk}(A) = \textnormal{rk}(A') = a \leq d$. Then there exists a unique invertible matrix $C \in \textnormal{GL}_a(\mathbb{R})$ such that $K' = CK$ and $Q' = C^{-\top}Q$.

Figures (4)

  • Figure 1: A slice of the space of lightning self-attention mechanisms.
  • Figure 2: Diagrammatic illustration of Equation \ref{['eq:triadic']}.
  • Figure 3: Plot of the estimated and expected dimensions of the neuromanifold as $\delta$ varies.
  • Figure 4: Diagrammatic illustration of the symmetry involved in the cancellation argument.

Theorems & Definitions (41)

  • Definition 1
  • Definition 2
  • Definition 3
  • Lemma 3.1
  • proof
  • Theorem 3.2
  • proof
  • Corollary 3.3
  • proof
  • Theorem 3.4
  • ...and 31 more