Table of Contents
Fetching ...

Deriving Transformer Architectures as Implicit Multinomial Regression

Jonas A. Actor, Anthony Gruber, Eric C. Cyr

TL;DR

This work establishes a theoretical link between attention mechanisms and multinomial regression by analyzing gradient-flow dynamics of feature representations $Z$ under cross-entropy loss $L(Z,\theta)$. For a linear model $N(Z,\theta)=Z\theta^\top$, the continuous-time dynamics $\dot{Z} = C\theta - \mathrm{CA}(Z,\theta)$ implement feature discovery with $\mathrm{CA}(Z,\theta)=\sigma_i(Z\theta^\top)\theta$, revealing how cross-attention emerges from optimization. Extending to a quadratic form $N(Z,\theta)=Z\theta Z^\top$ with symmetric $\theta=\phi\phi^\top$, the gradient involves self-attention terms $\mathrm{SA}$ and yields discrete transformer-like updates via operator splitting. A proof-of-principle experiment on Fashion MNIST shows that a few iterative attention updates can markedly improve classification on both clean and noisy inputs, illustrating the practical relevance of the gradient-based interpretation.

Abstract

While attention has been empirically shown to improve model performance, it lacks a rigorous mathematical justification. This short paper establishes a novel connection between attention mechanisms and multinomial regression. Specifically, we show that in a fixed multinomial regression setting, optimizing over latent features yields solutions that align with the dynamics induced on features by attention blocks. In other words, the evolution of representations through a transformer can be interpreted as a trajectory that recovers the optimal features for classification.

Deriving Transformer Architectures as Implicit Multinomial Regression

TL;DR

This work establishes a theoretical link between attention mechanisms and multinomial regression by analyzing gradient-flow dynamics of feature representations under cross-entropy loss . For a linear model , the continuous-time dynamics implement feature discovery with , revealing how cross-attention emerges from optimization. Extending to a quadratic form with symmetric , the gradient involves self-attention terms and yields discrete transformer-like updates via operator splitting. A proof-of-principle experiment on Fashion MNIST shows that a few iterative attention updates can markedly improve classification on both clean and noisy inputs, illustrating the practical relevance of the gradient-based interpretation.

Abstract

While attention has been empirically shown to improve model performance, it lacks a rigorous mathematical justification. This short paper establishes a novel connection between attention mechanisms and multinomial regression. Specifically, we show that in a fixed multinomial regression setting, optimizing over latent features yields solutions that align with the dynamics induced on features by attention blocks. In other words, the evolution of representations through a transformer can be interpreted as a trajectory that recovers the optimal features for classification.

Paper Structure

This paper contains 6 sections, 5 theorems, 10 equations, 1 figure, 1 table.

Key Result

Theorem 1

The $Z$-derivative of log-sum-exp applied to the linear model $Z\theta^\intercal$ satisfies $\partial_Z \mathrm{LSE}_i(Z\theta^\intercal) = \sigma_i(Z\theta^\intercal)^\intercal\mathop{\mathrm{\bar{\otimes}}}\nolimits\theta$.

Figures (1)

  • Figure 1: Evolution of a sample $Z^{(\ell)}$ when iterating through a transformer block. First line shows the original image $X$, initial (noisy) value $Z^{(0)}$, and subsequent iterates $Z^{(\ell)}$. Second line tracks the difference between $Z^{(0)}$ and $Z^{(\ell)}$, with substantial lightening around the collar, and darkening above the sleeves, in each iterate.

Theorems & Definitions (5)

  • Theorem 1
  • Lemma 1
  • Theorem 2
  • Corollary 1
  • Corollary 2