Learning in Compact Spaces with Approximately Normalized Transformer
Jörg K. H. Franke, Urs Spiegelhalter, Marianna Nezhurina, Jenia Jitsev, Frank Hutter, Michael Hefenbrock
TL;DR
The paper tackles training instability and hyperparameter sensitivity in Transformer-based language models by proposing an approximately normalized Transformer (anTransformer) that uses scalar normalization factors derived from concentration of measure to keep representations compact without explicit normalization layers. It introduces anGPT, an approximately normalized GPT, which replaces exact normalization with ν-based factors, applies logit scaling, and uses a reparameterization scheme to achieve uniform optimization dynamics while bounding parameter norms to remove weight decay. Empirical results show up to 40% faster convergence with only about 3% additional runtime per training step, and consistent improvements across model scales and downstream tasks, with scaling laws closely matching those of GPT while offering practical efficiency gains. The approach decouples stabilization from loss minimization, reducing hyperparameter tuning, enabling predictable scaling, and suggesting avenues for efficient low-precision training and broader applicability. Overall, the method demonstrates that approximate, concentration-driven normalization can deliver convergence benefits and robustness with minimal overhead, opening directions for large-scale and hardware-efficient language model training.
Abstract
The successful training of deep neural networks requires addressing challenges such as overfitting, numerical instabilities leading to divergence, and increasing variance in the residual stream. A common solution is to apply regularization and normalization techniques that usually require tuning additional hyperparameters. An alternative is to force all parameters and representations to lie on a hypersphere. This removes the need for regularization and increases convergence speed, but comes with additional costs. In this work, we propose a more holistic, approximate normalization via simple scalar multiplications motivated by the tight concentration of the norms of high-dimensional random vectors. Additionally, instead of applying strict normalization for the parameters, we constrain their norms. These modifications remove the need for weight decay and learning rate warm-up as well, but do not increase the total number of normalization layers. Our experiments with transformer architectures show up to 40% faster convergence compared to GPT models with QK normalization, with only 3% additional runtime cost. When deriving scaling laws, we found that our method enables training with larger batch sizes while preserving the favorable scaling characteristics of classic GPT architectures.
