Table of Contents
Fetching ...

nGPT: Normalized Transformer with Representation Learning on the Hypersphere

Ilya Loshchilov, Cheng-Ping Hsieh, Simeng Sun, Boris Ginsburg

TL;DR

nGPT introduces a normalized Transformer that confines all embeddings and states to unit norms on a hypersphere, enabling cosine-like interactions and a Riemannian-retraction interpretation of updates. The model employs per-dimension eigen learning rates to govern attention and MLP updates, decoupling block contributions and effectively turning the architecture into a variable-metric optimizer. Empirically, nGPT achieves 4×–20× faster convergence than GPT on OpenWebText across 1k–8k contexts and 0.5B–1B parameter scales, with improved conditioning of attention matrices and robust length extrapolation. Ablations show that many normalization and scaling components can be simplified without large performance losses, underscoring the practical viability of hypersphere-based representation learning for transformers.

Abstract

We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.

nGPT: Normalized Transformer with Representation Learning on the Hypersphere

TL;DR

nGPT introduces a normalized Transformer that confines all embeddings and states to unit norms on a hypersphere, enabling cosine-like interactions and a Riemannian-retraction interpretation of updates. The model employs per-dimension eigen learning rates to govern attention and MLP updates, decoupling block contributions and effectively turning the architecture into a variable-metric optimizer. Empirically, nGPT achieves 4×–20× faster convergence than GPT on OpenWebText across 1k–8k contexts and 0.5B–1B parameter scales, with improved conditioning of attention matrices and robust length extrapolation. Ablations show that many normalization and scaling components can be simplified without large performance losses, underscoring the practical viability of hypersphere-based representation learning for transformers.

Abstract

We propose a novel neural network architecture, the normalized Transformer (nGPT) with representation learning on the hypersphere. In nGPT, all vectors forming the embeddings, MLP, attention matrices and hidden states are unit norm normalized. The input stream of tokens travels on the surface of a hypersphere, with each layer contributing a displacement towards the target output predictions. These displacements are defined by the MLP and attention blocks, whose vector components also reside on the same hypersphere. Experiments show that nGPT learns much faster, reducing the number of training steps required to achieve the same accuracy by a factor of 4 to 20, depending on the sequence length.
Paper Structure (31 sections, 24 equations, 15 figures, 6 tables)

This paper contains 31 sections, 24 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Validation loss during training of 1B GPT and nGPT with 4k context length.
  • Figure 2: Final validation loss (y-axis) for training runs with different computation budgets in tokens (x-axis). The training of 0.5B and 1B nGPT models is about 4x, 10x and 20x faster (in terms of tokens) on 1k, 4k and 8k context lengths, respectively.
  • Figure 3: Models trained with 4k context length. Final performance (y-axis) on a set of downstream tasks and their average value (Bottom-Right) for different computation budgets in tokens (x-axis).
  • Figure 4: Left: Distribution of norms of vectors from input (Top line) and output (Bottom line) embedding matrices. Middle: Distribution of eigenvalues divided by its median value. Right: Pairwise distribution of dot products between embeddings. Models are trained for 100k iterations.
  • Figure 5: Median condition numbers for attention and MLP matrices at different layer depth (24 and 36 layers for 0.5B and 1B models, respectively). Models are trained for 100k iterations.
  • ...and 10 more figures