Table of Contents
Fetching ...

The Geometric Cost of Normalization: Affine Bounds on the Bayesian Complexity of Neural Networks

Sungbae Chun

Abstract

LayerNorm and RMSNorm impose fundamentally different geometric constraints on their outputs - and this difference has a precise, quantifiable consequence for model complexity. We prove that LayerNorm's mean-centering step, by confining data to a linear hyperplane (through the origin), reduces the Local Learning Coefficient (LLC) of the subsequent weight matrix by exactly $m/2$ (where $m$ is its output dimension); RMSNorm's projection onto a sphere preserves the LLC entirely. This reduction is structurally guaranteed before any training begins, determined by data manifold geometry alone. The underlying condition is a geometric threshold: for the codimension-one manifolds we study, the LLC drop is binary -- any non-zero curvature, regardless of sign or magnitude, is sufficient to preserve the LLC, while only affinely flat manifolds cause the drop. At finite sample sizes this threshold acquires a smooth crossover whose width depends on how much of the data distribution actually experiences the curvature, not merely on whether curvature exists somewhere. We verify both predictions experimentally with controlled single-layer scaling experiments using the wrLLC framework. We further show that Softmax simplex data introduces a "smuggled bias" that activates the same $m/2$ LLC drop when paired with an explicit downstream bias, proved via the affine symmetry extension of the main theorem and confirmed empirically.

The Geometric Cost of Normalization: Affine Bounds on the Bayesian Complexity of Neural Networks

Abstract

LayerNorm and RMSNorm impose fundamentally different geometric constraints on their outputs - and this difference has a precise, quantifiable consequence for model complexity. We prove that LayerNorm's mean-centering step, by confining data to a linear hyperplane (through the origin), reduces the Local Learning Coefficient (LLC) of the subsequent weight matrix by exactly (where is its output dimension); RMSNorm's projection onto a sphere preserves the LLC entirely. This reduction is structurally guaranteed before any training begins, determined by data manifold geometry alone. The underlying condition is a geometric threshold: for the codimension-one manifolds we study, the LLC drop is binary -- any non-zero curvature, regardless of sign or magnitude, is sufficient to preserve the LLC, while only affinely flat manifolds cause the drop. At finite sample sizes this threshold acquires a smooth crossover whose width depends on how much of the data distribution actually experiences the curvature, not merely on whether curvature exists somewhere. We verify both predictions experimentally with controlled single-layer scaling experiments using the wrLLC framework. We further show that Softmax simplex data introduces a "smuggled bias" that activates the same LLC drop when paired with an explicit downstream bias, proved via the affine symmetry extension of the main theorem and confirmed empirically.

Paper Structure

This paper contains 26 sections, 7 theorems, 13 equations, 2 figures, 3 tables.

Key Result

Proposition 1

Let $f(x) = Wx$ with $W \in \mathbb{R}^{m \times d}$, and let the input lie on the standard simplex $\Sigma^{d-1}$, so $\mathbf{1}^\top x = 1$. Decompose $x = x' + \frac{1}{d}\mathbf{1}$ where $\mathbf{1}^\top x' = 0$; then $f(x) = Wx' + b_\textup{smuggled}$ where $b_\textup{smuggled} = \frac{1}{d}W

Figures (2)

  • Figure 1: Curvature geometry and effective LLC.Left: Median LLC per manifold class ($d=5$, $m=4$, 5 seeds). All curved surfaces match the Gaussian baseline ($\approx 12.0$); only the flat hyperplane drops to $\approx 9.9 \approx 12.0 - m/2$. Curvature sign is irrelevant: the threshold is between flat and non-flat. Right: LLC vs. bump amplitude $A$ for wide ($\alpha=0.1$) and narrow ($\alpha=10.0$) bumps. The wide bump recovers near $A^* \approx 0.1$; the narrow bump stays near the flat bound until $A \geq 1.0$, showing that the effective RLCT depends on data--curvature overlap, not just the existence of curvature.
  • Figure 2: LLC reduction under LayerNorm and RMSNorm.Left: Varying output dimension $m$ with fixed $d=12$. $\Delta\lambda$ is consistent with the $m/2$ prediction for LayerNorm (with increasing variance at large $m$); RMSNorm shows $\Delta\lambda \approx 0$ throughout. Right: Varying input dimension $d$ with fixed $m=4$. $\Delta\lambda \approx 2.0$ regardless of $d$, confirming that the drop depends only on the number of dimensions lost to the constraint.

Theorems & Definitions (18)

  • Remark 1: Lagrangian Mechanics Analogy
  • Remark 2: Semi-algebraic nature of the Softmax constraint
  • Proposition 1: Simplex-Bias Duality
  • Conjecture 1: Post-LayerNorm Smuggled-Bias Degeneracy
  • Theorem 1: Symmetry-Induced RLCT Reduction
  • Corollary 1: LLC Drop for Normalization Layers
  • Remark 3: Extension to deep networks
  • Corollary 2: Layer-wise Additivity under wrLLC (conditional)
  • Remark 4
  • Remark 5: wrLLC as a lower bound on degeneracy
  • ...and 8 more