nGPT: Normalized Transformer with Representation Learning on the Hypersphere
Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, Boris Ginsburg
TL;DR
nGPT introduces a normalized Transformer that confines all embeddings and states to unit norms on a hypersphere, enabling cosine-like interactions and a Riemannian-retraction interpretation of updates. The model employs per-dimension eigen learning rates to govern attention and MLP updates, decoupling block contributions and effectively turning the architecture into a variable-metric optimizer. Empirically, nGPT achieves 4×–20× faster convergence than GPT on OpenWebText across 1k–8k contexts and 0.5B–1B parameter scales, with improved conditioning of attention matrices and robust length extrapolation. Ablations show that many normalization and scaling components can be simplified without large performance losses, underscoring the practical viability of hypersphere-based representation learning for transformers.
Abstract
We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.
