Table of Contents
Fetching ...

Learning in Compact Spaces with Approximately Normalized Transformer

Jörg K. H. Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, Michael Hefenbrock

TL;DR

The paper tackles training instability and hyperparameter sensitivity in Transformer-based language models by proposing an approximately normalized Transformer (anTransformer) that uses scalar normalization factors derived from concentration of measure to keep representations compact without explicit normalization layers. It introduces anGPT, an approximately normalized GPT, which replaces exact normalization with ν-based factors, applies logit scaling, and uses a reparameterization scheme to achieve uniform optimization dynamics while bounding parameter norms to remove weight decay. Empirical results show up to 40% faster convergence with only about 3% additional runtime per training step, and consistent improvements across model scales and downstream tasks, with scaling laws closely matching those of GPT while offering practical efficiency gains. The approach decouples stabilization from loss minimization, reducing hyperparameter tuning, enabling predictable scaling, and suggesting avenues for efficient low-precision training and broader applicability. Overall, the method demonstrates that approximate, concentration-driven normalization can deliver convergence benefits and robustness with minimal overhead, opening directions for large-scale and hardware-efficient language model training.

Abstract

The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization and normalization techniques that usually require tuning additional hyperparameters. An alternative is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic, approximate normalization via simple scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. Additionally, instead of applying strict normalization for the parameters, we constrain their norms. These modifications remove the need for weight decay and learning rate warm-up as well, but do not increase the total number of normalization layers. Our experiments with transformer architectures show up to 40% faster convergence compared to GPT models with QK normalization, with only 3% additional runtime cost. When deriving scaling laws, we found that our method enables training with larger batch sizes while preserving the favorable scaling characteristics of classic GPT architectures.

Learning in Compact Spaces with Approximately Normalized Transformer

TL;DR

The paper tackles training instability and hyperparameter sensitivity in Transformer-based language models by proposing an approximately normalized Transformer (anTransformer) that uses scalar normalization factors derived from concentration of measure to keep representations compact without explicit normalization layers. It introduces anGPT, an approximately normalized GPT, which replaces exact normalization with ν-based factors, applies logit scaling, and uses a reparameterization scheme to achieve uniform optimization dynamics while bounding parameter norms to remove weight decay. Empirical results show up to 40% faster convergence with only about 3% additional runtime per training step, and consistent improvements across model scales and downstream tasks, with scaling laws closely matching those of GPT while offering practical efficiency gains. The approach decouples stabilization from loss minimization, reducing hyperparameter tuning, enabling predictable scaling, and suggesting avenues for efficient low-precision training and broader applicability. Overall, the method demonstrates that approximate, concentration-driven normalization can deliver convergence benefits and robustness with minimal overhead, opening directions for large-scale and hardware-efficient language model training.

Abstract

The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization and normalization techniques that usually require tuning additional hyperparameters. An alternative is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic, approximate normalization via simple scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. Additionally, instead of applying strict normalization for the parameters, we constrain their norms. These modifications remove the need for weight decay and learning rate warm-up as well, but do not increase the total number of normalization layers. Our experiments with transformer architectures show up to 40% faster convergence compared to GPT models with QK normalization, with only 3% additional runtime cost. When deriving scaling laws, we found that our method enables training with larger batch sizes while preserving the favorable scaling characteristics of classic GPT architectures.

Paper Structure

This paper contains 43 sections, 1 theorem, 13 equations, 16 figures, 10 tables.

Key Result

Theorem 1

Vershynin_2018_hdpbook Let $\bm{x} \sim \mathcal{U}(S^{d - 1})$ be a random vector uniformly distributed on the Euclidean unit sphere $S^{d - 1} = \{\bm{x} \in \mathbb{R}^d : \Vert \bm{x} \Vert_2 = 1\}$ and let $f : S^{d - 1} \to \mathbb{R}$ be a Lipschitz function. Then, for every $t \ge 0$, where $c>0$ and $\Vert f \Vert _{\text{Lip}}^2$ denotes the Lipschitz norm (smallest Lipschitz constant)

Figures (16)

  • Figure 1: The input norm on log scale for each layer as a function of training a $0.5B$ model on $10B$ tokens. Deeper layers obtain a higher input norm in the classical GPT. While nGPT completely eliminates this "Curse of Depth", anGPT effectively mitigates it.
  • Figure 2: Scaling trend fits for optimal batch size and learning rate as functions of model size $N$. Grid point markers are shaded by excess loss relative to all configurations for this parameter. Diamond markers show the two-stage interpolation-based estimates of optimal hyperparameters. Dashed lines represent fitted power laws using the estimated optimal hyperparameters.
  • Figure 3: Training the $0.5B$ model up to $7\times$ Chinchilla optimal token budget. Each point is the final validation loss of a full training run with the training budget noted on the abscissa. Below, we measure the convergence speed-up against the GPT+ model with QK normalization.
  • Figure 4: Training different model sizes on different token budgets. Each point represents a full training with the training budget noted on the abscissa. The scaling law is fitted for both architectures as described in Appendix \ref{['app:scaling_laws']} and indicated by the dashed line.
  • Figure 5: We run ablation experiments with a $0.5B$ parameter model using $10B$ tokens from OpenWebText. Adding QK norm shows a performance gain. We modify nGPT by replacing the normalization of the LERP update with a normalization factor and, in addition, by bounding weights instead of normalizing them. The anGPT mainly replaces scaling vectors by normalization factors.
  • ...and 11 more figures

Theorems & Definitions (1)

  • Theorem 1: Concentration of Lipschitz functions on the sphere