Table of Contents
Fetching ...

Geometry and Dynamics of LayerNorm

Paul M. Riechers

TL;DR

LayerNorm is ubiquitous in deep networks but its geometric action is not transparent. The authors present a explicit expression that decomposes LayerNorm into projection onto the hyperplane orthogonal to $\hat{1}$, normalization by $ (\sigma^2+\epsilon)^{-1/2}$, a diagonal gain $G=\text{diag}(g)$, and a bias $b$, namely $\text{LayerNorm}(a,g,b,\epsilon) = \sqrt{N} \text{diag}(g) \frac{\Pi a}{\sqrt{|\Pi a|^2+N\epsilon}} + b$ with $\Pi = I - \hat{1}\hat{1}^T$; this clarifies that outputs lie in the intersection of an $(N-1)$-dimensional hyperplane and an $N$-ball, forming an $(N-1)$-dimensional hyperellipsoid. The paper identifies the orthogonal subspace to the LayerNorm image through the left nullspace of $\text{diag}(g) - g 1^T / N$, and characterizes the principal axes as the eigenstates of $\Pi_2 G^{-2} \Pi_2$ with semi-axes $\sqrt{N/\lambda}$, where $G=\text{diag}(g)$ and $\alpha_n = 1/g_n$; it also discusses cases with zero gain via a Drazin inverse. These insights extend to Transformer architectures, where LayerNorm appears in several residual paths, offering a more intuitive, geometry-based understanding of its role in high-dimensional activation maps.

Abstract

A technical note aiming to offer deeper intuition for the LayerNorm function common in deep neural networks. LayerNorm is defined relative to a distinguished 'neural' basis, but it does more than just normalize the corresponding vector elements. Rather, it implements a composition -- of linear projection, nonlinear scaling, and then affine transformation -- on input activation vectors. We develop both a new mathematical expression and geometric intuition, to make the net effect more transparent. We emphasize that, when LayerNorm acts on an N-dimensional vector space, all outcomes of LayerNorm lie within the intersection of an (N-1)-dimensional hyperplane and the interior of an N-dimensional hyperellipsoid. This intersection is the interior of an (N-1)-dimensional hyperellipsoid, and typical inputs are mapped near its surface. We find the direction and length of the principal axes of this (N-1)-dimensional hyperellipsoid via the eigen-decomposition of a simply constructed matrix.

Geometry and Dynamics of LayerNorm

TL;DR

LayerNorm is ubiquitous in deep networks but its geometric action is not transparent. The authors present a explicit expression that decomposes LayerNorm into projection onto the hyperplane orthogonal to , normalization by , a diagonal gain , and a bias , namely with ; this clarifies that outputs lie in the intersection of an -dimensional hyperplane and an -ball, forming an -dimensional hyperellipsoid. The paper identifies the orthogonal subspace to the LayerNorm image through the left nullspace of , and characterizes the principal axes as the eigenstates of with semi-axes , where and ; it also discusses cases with zero gain via a Drazin inverse. These insights extend to Transformer architectures, where LayerNorm appears in several residual paths, offering a more intuitive, geometry-based understanding of its role in high-dimensional activation maps.

Abstract

A technical note aiming to offer deeper intuition for the LayerNorm function common in deep neural networks. LayerNorm is defined relative to a distinguished 'neural' basis, but it does more than just normalize the corresponding vector elements. Rather, it implements a composition -- of linear projection, nonlinear scaling, and then affine transformation -- on input activation vectors. We develop both a new mathematical expression and geometric intuition, to make the net effect more transparent. We emphasize that, when LayerNorm acts on an N-dimensional vector space, all outcomes of LayerNorm lie within the intersection of an (N-1)-dimensional hyperplane and the interior of an N-dimensional hyperellipsoid. This intersection is the interior of an (N-1)-dimensional hyperellipsoid, and typical inputs are mapped near its surface. We find the direction and length of the principal axes of this (N-1)-dimensional hyperellipsoid via the eigen-decomposition of a simply constructed matrix.
Paper Structure (6 sections, 6 equations, 2 figures)

This paper contains 6 sections, 6 equations, 2 figures.

Figures (2)

  • Figure 1: LayerNorm as a composition of (i) Projection, (ii) Normalization, (iii) Linear transformation, and (iv) Global shift. Each point in the sequence of panels represents an activation vector input to LayerNorm. The (R,G,B) color values of each point directly encode the original activation of input neurons 1, 2, and 3, respectively in this simple example with $N=3$. The second and third panels explicitly show the $\hat{1}$ vector orthogonal to the initial projection, while the last two panels show the unit vector $\hat{\alpha} = \vec{\alpha} / \alpha$ (with neural-basis components of $\vec{\alpha}$ given by $\alpha_n = 1/g_n$) which is orthogonal to the new plane after the linear transformation by diag$(\vec{g})$. (The arrow depicting $\hat{\alpha}$ is shifted by $\vec{b}$ in the final case.) The last two panels also show the semi-axes of the principal axes as dashed blue lines, which correspond to eigenstates of $\Pi_2 G^{-2} \Pi_2$, with length $\sqrt{N/\lambda}$. We have depicted an unusually large $\epsilon = 1/10$ to demonstrate that the interior of the hyperellipsoid is not strictly empty after LayerNorm. But see Fig. 2.
  • Figure 2: From left to right, we compare the net effect of LayerNorm for $\epsilon = 10^{-1}$, $10^{-3}$, and $10^{-5}$. The final case corresponds to the default value for this small parameter in the standard PyTorch LayerNorm function. In practice, we should expect that most points get mapped very near the surface of the hyperellipsoid.