Table of Contents
Fetching ...

Optimal Scaling Needs Optimal Norm

Oleg Filatov, Jiangtao Wang, Jan Ebert, Stefan Kesselheim

TL;DR

The Scion optimizer is used, using the Scion optimizer, to discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer.

Abstract

Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair $(η^{\ast}, B^{\ast})$ consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple $(η, B)$ reach the optimal norm, only a unique $(η^{\ast}, B^{\ast})$ achieves the best loss. As a sufficient condition, we provide the first measurement of $(η^{\ast}, B^{\ast})$ scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.

Optimal Scaling Needs Optimal Norm

TL;DR

The Scion optimizer is used, using the Scion optimizer, to discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer.

Abstract

Despite recent progress in optimal hyperparameter transfer under model and dataset scaling, no unifying explanatory principle has been established. Using the Scion optimizer, we discover that joint optimal scaling across model and dataset sizes is governed by a single invariant: the operator norm of the output layer. Across models with up to 1.3B parameters trained on up to 138B tokens, the optimal learning rate/batch size pair consistently has the same operator norm value - a phenomenon we term norm transfer. This constant norm condition is necessary but not sufficient: while for each dataset size, multiple reach the optimal norm, only a unique achieves the best loss. As a sufficient condition, we provide the first measurement of scaling with dataset size for Scion, and find that the scaling rules are consistent with those of the Adam optimizer. Tuning per-layer-group learning rates also improves model performance, with the output layer being the most sensitive and hidden layers benefiting from lower learning rates. We provide practical insights on norm-guided optimal scaling and release our Distributed Scion (Disco) implementation with logs from over two thousand runs to support research on LLM training dynamics at scale.

Paper Structure

This paper contains 30 sections, 9 equations, 14 figures, 3 tables, 4 algorithms.

Figures (14)

  • Figure 1: (a) Interplay of training loss, output layer norm $\lVert {\bm{W}}_\mathrm{out} \rVert_{\mathrm{RMS} \to \infty}$ and learning rate. Results are for the proxy model (69M parameters), batch size $B=128$ samples and horizon $D=2^{33}$ tokens. Points are colored by $\log_2(\eta)$ where $\eta$ is the learning rate. Black dashed lines mark the optimal configuration with minimum training loss. (b) Growth of the output layer norm vs. gradient steps. Each curve corresponds to a (learning rate $\eta$, batch size $B$) pair, with $B$ measured in samples; colour encodes batch size and line style encodes learning rate. See also the same plot vs. token horizons in Appendix \ref{['app:norm-evolution-all']}.
  • Figure 2: Training loss vs. output layer norm across batch sizes.(a) Fixed proxy model (69M parameters) while increasing token horizon from $2^{31}$ to $2^{37}$. (b) Fixed token horizon $2^{33}$ while scaling width/depth of the proxy model as indicated in the legend. Each batch size point (increasing from $32$ in $\times 2$ steps, reflected by marker size) has its learning rate optimally tuned. The optimal batch size per horizon/model configuration is indicated by the filled marker. All curves share optimal norm at $7.0 \pm 0.2$ across horizons and $7.4 \pm 0.2$ across models (grey band).
  • Figure 3: (a) $(\eta,B)$ combinations that reach the optimal norm $\lVert {\bm{W}}_\mathrm{out} \rVert_{\mathrm{RMS}\to\infty}=2^{7.0\pm0.2}$ for a given token horizon. Colours denote batch size ($B$); the y-axis is learning rate ($\eta$). Solid and dashed lines denote free and heuristic fits (described in text). (b) Optimal learning rate per batch size across horizons. Circled markers indicate optimal $(\eta^*,B^*)$ with the lowest loss. Within a horizon, marker transparency linearly interpolates between the lowest- and highest-loss runs, with higher transparency indicating higher training loss. Error bars show systematic variation from the fitting method (Appendix \ref{['app:norm-extraction']}). Dashed lines are a joint linear regression with $\log_2{\eta^\ast} \sim \log_2{B} + \log_2{D}$.
  • Figure 4: (a) Parallel-coordinates view of per-layer-group learning rate tuning. Results are for the proxy model (69M parameters) and batch size $B=128$ samples, averaged across random seeds as described in Appendix \ref{['app:model-cfg']}. Dark gray lines are the top 10% runs (loss 4.11--4.18); light gray lines are the remainder (loss 4.19--4.76). Orange traces highlight the three best settings. The inset histogram shows the distribution of top 10% counts for each layer group. (b) Best learning rate layouts per training horizon under the constraint $\eta_{\text{input}}=\eta_{\text{output}}$. Results are for the proxy model (69M parameters) and batch size $B=512$ samples. All horizons favor a V-shaped layout with $\eta_{\text{hidden}}$ smaller than the input/output learning rates by the same $\times 1/8$ factor. In the legend we also report loss for the optimal $\eta_{\text{input}} = \eta_{\text{hidden}} = \eta_{\text{output}} \equiv \eta$ layout ("equal @$\eta$").
  • Figure 5: Growth of the output layer norm $\lVert {\bm{W}}_\mathrm{out} \rVert_{\mathrm{RMS} \to \infty}$ vs. horizon, in tokens (a) and number of steps (b). Results are for the proxy model (69M parameters). Each curve is a (learning rate $\eta$, batch size $B$) pair, with $B$ measured in samples: colour encodes batch size and line style encodes learning rate, as described in the legend.
  • ...and 9 more figures

Theorems & Definitions (2)

  • Definition 1: Spectral condition
  • Definition 2: Induced operator norm