Geometric Interpretation of Layer Normalization and a Comparative Analysis with RMSNorm

Akshat Gupta; Atahan Ozdemir; Gopala Anumanchipalli

Geometric Interpretation of Layer Normalization and a Comparative Analysis with RMSNorm

Akshat Gupta, Atahan Ozdemir, Gopala Anumanchipalli

TL;DR

This work introduces a geometric interpretation of LayerNorm by showing standardization removes the projection along the uniform vector and then scales the orthogonal component to a fixed norm, revealing the process’ irreversibility. It provides empirical evidence from decoder‑only LLMs that hidden representations are already orthogonal to the uniform vector at inference, making the mean subtraction in LayerNorm redundant and suggesting RMSNorm as a more efficient alternative. Across models, norm stabilization and rotation effects are quantified, demonstrating that RMSNorm can achieve similar orientation and stability without the extra mean subtraction. The findings advocate adopting RMSNorm for efficiency in large language models while preserving performance, and they offer a principled basis for rethinking normalization in Transformers.

Abstract

This paper presents a novel geometric interpretation of LayerNorm and explores how LayerNorm influences the norm and orientation of hidden vectors in the representation space. With these geometric insights, we prepare the foundation for comparing LayerNorm with RMSNorm. We show that the definition of LayerNorm is innately linked to the uniform vector, defined as $\boldsymbol{1} = [1, 1, 1, 1, \cdots, 1]^T \in \mathbb{R}^d$. We then show that the standardization step in LayerNorm can be understood in three simple steps: (i) remove the component of a vector along the uniform vector, (ii) normalize the remaining vector, and (iii) scale the resultant vector by $\sqrt{d}$, where $d$ is the dimensionality of the representation space. We also provide additional insights into how LayerNorm operates at inference time. Finally, we compare the hidden representations of LayerNorm-based LLMs with models trained using RMSNorm and show that all LLMs naturally operate orthogonal to the uniform vector at inference time, that is, on average they do not have a component along the uniform vector during inference. This presents the first mechanistic evidence that removing the component along the uniform vector in LayerNorm is a redundant step. These results advocate for using RMSNorm over LayerNorm which is also more computationally efficient.

Geometric Interpretation of Layer Normalization and a Comparative Analysis with RMSNorm

TL;DR

Abstract

. We then show that the standardization step in LayerNorm can be understood in three simple steps: (i) remove the component of a vector along the uniform vector, (ii) normalize the remaining vector, and (iii) scale the resultant vector by

, where

is the dimensionality of the representation space. We also provide additional insights into how LayerNorm operates at inference time. Finally, we compare the hidden representations of LayerNorm-based LLMs with models trained using RMSNorm and show that all LLMs naturally operate orthogonal to the uniform vector at inference time, that is, on average they do not have a component along the uniform vector during inference. This presents the first mechanistic evidence that removing the component along the uniform vector in LayerNorm is a redundant step. These results advocate for using RMSNorm over LayerNorm which is also more computationally efficient.

Paper Structure (16 sections, 12 equations, 13 figures, 4 tables, 1 algorithm)

This paper contains 16 sections, 12 equations, 13 figures, 4 tables, 1 algorithm.

Introduction
Re-Introducing Layer Normalization
The Uniform Vector and the Mean Vector
Explanation of Layer Normalization
The Irreversibility of Layer Normalization
Experiments
Methods and Models
Norm Stabilization
Rotation
LayerNorm versus RMSNorm
Conclusion
Limitations
Appendix
Computations in Decoder-only LLMs
Hyperparameters and Computation Resources
...and 1 more sections

Figures (13)

Figure 1: A diagrammatic explanation of LayerNorm and RMSNorm.
Figure 2: Visualization of LayerNorm operation on a random original vector
Figure 3: Rotation angle (in degrees) between the hidden vectors and Post-LN1 vectors across all layers for GPT-J, Pythia 6.9, Llama-3
Figure 4: This figure shows the growing norm of the residual stream or hidden vectors at each layer (a-c) and how LayerNorm and RMSNorm regulate the growing norms (d-f) for GPTJ 6B, Pythia 6.9B and Llama3 8B. The dashed lines in the line plots represent one standard deviation.
Figure 5: Distribution of angles (in degrees) between Hidden vectors (a-c) and post-normalization vectors (d-f) with the uniform vector for GPT-J, Pythia 6.9, Llama-3 for a randomly selected layer (Layer 24). The results are independent of the choice of layers.
...and 8 more figures

Geometric Interpretation of Layer Normalization and a Comparative Analysis with RMSNorm

TL;DR

Abstract

Geometric Interpretation of Layer Normalization and a Comparative Analysis with RMSNorm

Authors

TL;DR

Abstract

Table of Contents

Figures (13)