Scaling Exponents Across Parameterizations and Optimizers

Katie Everett; Lechao Xiao; Mitchell Wortsman; Alexander A. Alemi; Roman Novak; Peter J. Liu; Izzeddin Gur; Jascha Sohl-Dickstein; Leslie Pack Kaelbling; Jaehoon Lee; Jeffrey Pennington

Scaling Exponents Across Parameterizations and Optimizers

Katie Everett, Lechao Xiao, Mitchell Wortsman, Alexander A. Alemi, Roman Novak, Peter J. Liu, Izzeddin Gur, Jascha Sohl-Dickstein, Leslie Pack Kaelbling, Jaehoon Lee, Jeffrey Pennington

TL;DR

This work broadens the theory and practice of width scaling by introducing an alignment-general space of parameterizations that accounts for three alignment contributions and a feature-learning residual. It demonstrates, through extensive experiments across multiple optimizers and architectures, that all parameterizations can achieve hyperparameter transfer when paired with principled per-layer learning-rate exponents and layer-wise constants, challenging the prior emphasis on muP. A key practical insight is the critical role of the epsilon term in adaptive optimizers, with a proposed Adam-atan2 variant offering a scale-invariant, epsilon-free alternative. Collectively, the results provide a more flexible and robust framework for scaling transformers, including recommendations for standard parameterization with per-layer LR and the need to consider compute-horizon effects in future theory and practice.

Abstract

Robust and effective scaling of models from small to large width typically requires the precise adjustment of many algorithmic and architectural details, such as parameterization and optimizer choices. In this work, we propose a new perspective on parameterization by investigating a key assumption in prior work about the alignment between parameters and data and derive new theoretical results under weaker assumptions and a broader set of optimizers. Our extensive empirical investigation includes tens of thousands of models trained with all combinations of three optimizers, four parameterizations, several alignment assumptions, more than a dozen learning rates, and fourteen model sizes up to 26.8B parameters. We find that the best learning rate scaling prescription would often have been excluded by the assumptions in prior work. Our results show that all parameterizations, not just maximal update parameterization (muP), can achieve hyperparameter transfer; moreover, our novel per-layer learning rate prescription for standard parameterization outperforms muP. Finally, we demonstrate that an overlooked aspect of parameterization, the epsilon parameter in Adam, must be scaled correctly to avoid gradient underflow and propose Adam-atan2, a new numerically stable, scale-invariant version of Adam that eliminates the epsilon hyperparameter entirely.

Scaling Exponents Across Parameterizations and Optimizers

TL;DR

Abstract

Paper Structure (57 sections, 24 equations, 31 figures, 11 tables)

This paper contains 57 sections, 24 equations, 31 figures, 11 tables.

Introduction
Background
Parameterizations and Optimizers
Stability, nontriviality and feature learning
Alignment
Theoretical Contributions
Model and Notation
Equivalence classes
Alignment-General Space of Parameterizations
Stability at initialization
Stability during training
Prior Work as a Special Case
Maximum Stable Learning Rate Exponents
Alignment Ratio
Experiments
...and 42 more sections

Figures (31)

Figure 1: The four parameterizations occupy two equivalence classes at initialization, which differ only in the readout layer. Each parameterization is plotted for each layer type at $(a_l, b_l)$ where $a_l$ is the negative parameter multiplier exponent and $b_l$ is the negative initialization standard deviation exponent. The black dashed lines span the equivalence classes for each layer. The region where parameterizations are stable is highlighted in gray: this is the line $a_1 + b_1 = 0$ for the embedding layer, the line $a_l + b_l = 1/2$ for hidden layers, and the region $a_{L+1} + b_{L+1} \geq 1/2$ for the readout layer. For equivalence during training, the learning rates must also obey the optimizer-specific equivalence relations.
Figure 2: Alignment is intermediate and highly dynamic throughout training, with parameterization-specific patterns. The log alignment ratio metric in readout and hidden (MLP) layers across training steps for each parameterization, for Adam $1.9B$ parameter models ($H=32, D=4096, B=256$) using optimal global learning rates. Blue and green curves are for the first and second MLP layers, respectively, in each Transformer block. Transformer blocks are denoted B0 through B7 in the legend. Orange curves are the readout layer.
Figure 3: All parameterizations for Adam benefit from per-layer learning rates and tuning per-layer constant learning rate multipliers. Eval loss comparisons for all parameterizations using Adam across a sequence of interventions. From left to right panels: (a) global lr exponents + default constants, (b) per-layer lr exponents assuming full alignment + default constants, (c) per-layer lr exponents assuming full alignment + optimal constants, (d) per-layer lr exponents assuming no alignment + optimal constants.
Figure 4: All parameterizations can perform hyperparameter transfer with the right per-layer learning rate exponents. Top row = learning rate sweep for all parameterizations using Adam with per-layer learning rates assuming full alignment and optimal constants. The LR scaling is fully encapsulated by the per-layer LR exponents so the base learning rate is consistent across model widths. Bottom row = power law fit of optimal LR vs model dim, with exponents close to zero indicating the same base LR can be reused at all model widths. Only mean-field parameterization deviates slightly from zero, which is improved by addressing epsilon underflow in \ref{['sec:results_epsilon']}.
Figure 5: For SGD + momentum (top row) and Adam + parameter scaling (bottom row), eval loss comparisons for all parameterizations across a sequence of interventions. From left to right columns: (a) global LR exponents + default constants, (b) per-layer LR exponents assuming full alignment + default constants, (c) per-layer LR exponents assuming full alignment + optimal constants, (d) per-layer LR exponents assuming no alignment + optimal constants.
...and 26 more figures

Theorems & Definitions (6)

Definition B.1
Definition B.2
Definition B.3
Definition B.4
Definition B.5
Definition B.6

Scaling Exponents Across Parameterizations and Optimizers

TL;DR

Abstract

Scaling Exponents Across Parameterizations and Optimizers

Authors

TL;DR

Abstract

Table of Contents

Figures (31)

Theorems & Definitions (6)