Learning in Compact Spaces with Approximately Normalized Transformer

Jörg K. H. Franke; Urs Spiegelhalter; Marianna Nezhurina; Jenia Jitsev; Frank Hutter; Michael Hefenbrock

Learning in Compact Spaces with Approximately Normalized Transformer

Jörg K. H. Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, Michael Hefenbrock

TL;DR

The paper tackles training instability and hyperparameter sensitivity in Transformer-based language models by proposing an approximately normalized Transformer (anTransformer) that uses scalar normalization factors derived from concentration of measure to keep representations compact without explicit normalization layers. It introduces anGPT, an approximately normalized GPT, which replaces exact normalization with ν-based factors, applies logit scaling, and uses a reparameterization scheme to achieve uniform optimization dynamics while bounding parameter norms to remove weight decay. Empirical results show up to 40% faster convergence with only about 3% additional runtime per training step, and consistent improvements across model scales and downstream tasks, with scaling laws closely matching those of GPT while offering practical efficiency gains. The approach decouples stabilization from loss minimization, reducing hyperparameter tuning, enabling predictable scaling, and suggesting avenues for efficient low-precision training and broader applicability. Overall, the method demonstrates that approximate, concentration-driven normalization can deliver convergence benefits and robustness with minimal overhead, opening directions for large-scale and hardware-efficient language model training.

Abstract

The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization and normalization techniques that usually require tuning additional hyperparameters. An alternative is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic, approximate normalization via simple scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. Additionally, instead of applying strict normalization for the parameters, we constrain their norms. These modifications remove the need for weight decay and learning rate warm-up as well, but do not increase the total number of normalization layers. Our experiments with transformer architectures show up to 40% faster convergence compared to GPT models with QK normalization, with only 3% additional runtime cost. When deriving scaling laws, we found that our method enables training with larger batch sizes while preserving the favorable scaling characteristics of classic GPT architectures.

Learning in Compact Spaces with Approximately Normalized Transformer

TL;DR

Abstract

Learning in Compact Spaces with Approximately Normalized Transformer

TL;DR

Abstract

Paper Structure

Table of Contents

Key Result

Figures (16)

Theorems & Definitions (1)