Geometry and Dynamics of LayerNorm
Paul M. Riechers
TL;DR
LayerNorm is ubiquitous in deep networks but its geometric action is not transparent. The authors present a explicit expression that decomposes LayerNorm into projection onto the hyperplane orthogonal to $\hat{1}$, normalization by $ (\sigma^2+\epsilon)^{-1/2}$, a diagonal gain $G=\text{diag}(g)$, and a bias $b$, namely $\text{LayerNorm}(a,g,b,\epsilon) = \sqrt{N} \text{diag}(g) \frac{\Pi a}{\sqrt{|\Pi a|^2+N\epsilon}} + b$ with $\Pi = I - \hat{1}\hat{1}^T$; this clarifies that outputs lie in the intersection of an $(N-1)$-dimensional hyperplane and an $N$-ball, forming an $(N-1)$-dimensional hyperellipsoid. The paper identifies the orthogonal subspace to the LayerNorm image through the left nullspace of $\text{diag}(g) - g 1^T / N$, and characterizes the principal axes as the eigenstates of $\Pi_2 G^{-2} \Pi_2$ with semi-axes $\sqrt{N/\lambda}$, where $G=\text{diag}(g)$ and $\alpha_n = 1/g_n$; it also discusses cases with zero gain via a Drazin inverse. These insights extend to Transformer architectures, where LayerNorm appears in several residual paths, offering a more intuitive, geometry-based understanding of its role in high-dimensional activation maps.
Abstract
A technical note aiming to offer deeper intuition for the LayerNorm function common in deep neural networks. LayerNorm is defined relative to a distinguished 'neural' basis, but it does more than just normalize the corresponding vector elements. Rather, it implements a composition -- of linear projection, nonlinear scaling, and then affine transformation -- on input activation vectors. We develop both a new mathematical expression and geometric intuition, to make the net effect more transparent. We emphasize that, when LayerNorm acts on an N-dimensional vector space, all outcomes of LayerNorm lie within the intersection of an (N-1)-dimensional hyperplane and the interior of an N-dimensional hyperellipsoid. This intersection is the interior of an (N-1)-dimensional hyperellipsoid, and typical inputs are mapped near its surface. We find the direction and length of the principal axes of this (N-1)-dimensional hyperellipsoid via the eigen-decomposition of a simply constructed matrix.
