Table of Contents
Fetching ...

Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture

Nihal Mehta

TL;DR

The paper addresses why self-attention works by recasting it as a projection of corpus-level co-occurrence statistics into sequence context, starting from a symmetric co-occurrence operator $S$ and moving through a sequence of refinements to reach the Transformer attention mechanism. It introduces the core projection $M=QSQ^{\top}$, extends to vocabulary-level predictions with $E=M(QS)$, and then incorporates positional structure and directional asymmetry to connect to the full Transformer block with $\mathrm{softmax}\left(\frac{HW_QW_K^{\top}H^{\top}}{\sqrt{d_k}}\right)HW_V$. The key contributions include a uniqueness result for the projection, a principled method to derive positional and directional components from the same projection principle, and a unified interpretation of multi-head and cross-attention as structured refinements of distributional semantics. This framework bridges classical distributional semantics with modern Transformer architectures, offering a principled, math-based understanding of why attention mechanisms succeed in language modeling and how their components arise from projection principles onto context-relevant subspaces.

Abstract

This paper presents a mathematical interpretation of self-attention by connecting it to distributional semantics principles. We show that self-attention emerges from projecting corpus-level co-occurrence statistics into sequence context. Starting from the co-occurrence matrix underlying GloVe embeddings, we demonstrate how the projection naturally captures contextual influence, with the query-key-value mechanism arising as the natural asymmetric extension for modeling directional relationships. Positional encodings and multi-head attention then follow as structured refinements of this same projection principle. Our analysis demonstrates that the Transformer architecture's particular algebraic form follows from these projection principles rather than being an arbitrary design choice.

Self-Attention as Distributional Projection: A Unified Interpretation of Transformer Architecture

TL;DR

The paper addresses why self-attention works by recasting it as a projection of corpus-level co-occurrence statistics into sequence context, starting from a symmetric co-occurrence operator and moving through a sequence of refinements to reach the Transformer attention mechanism. It introduces the core projection , extends to vocabulary-level predictions with , and then incorporates positional structure and directional asymmetry to connect to the full Transformer block with . The key contributions include a uniqueness result for the projection, a principled method to derive positional and directional components from the same projection principle, and a unified interpretation of multi-head and cross-attention as structured refinements of distributional semantics. This framework bridges classical distributional semantics with modern Transformer architectures, offering a principled, math-based understanding of why attention mechanisms succeed in language modeling and how their components arise from projection principles onto context-relevant subspaces.

Abstract

This paper presents a mathematical interpretation of self-attention by connecting it to distributional semantics principles. We show that self-attention emerges from projecting corpus-level co-occurrence statistics into sequence context. Starting from the co-occurrence matrix underlying GloVe embeddings, we demonstrate how the projection naturally captures contextual influence, with the query-key-value mechanism arising as the natural asymmetric extension for modeling directional relationships. Positional encodings and multi-head attention then follow as structured refinements of this same projection principle. Our analysis demonstrates that the Transformer architecture's particular algebraic form follows from these projection principles rather than being an arbitrary design choice.

Paper Structure

This paper contains 36 sections, 1 theorem, 43 equations.

Key Result

Proposition 1

Let $T(S)\in\mathbb{R}^{R\times R}$ be a linear operator that satisfies: Then $T(S)=QSQ^{\top}$ is the unique operator satisfying both conditions.

Theorems & Definitions (2)

  • Proposition 1: Uniqueness of the Projection
  • proof