Table of Contents
Fetching ...

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

Egor Shulgin, Dimitri von Rütte, Tianyue H. Zhang, Niccolò Ajroldi, Bernhard Schölkopf, Antonio Orvieto

Abstract

Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon. Treating bounds in recent literature as a proxy and minimizing them across different tuning regimes yields closed-form power-law schedules for learning rate, momentum, and batch size as functions of the iteration or token budget. Our analysis, holding model size fixed, recovers most insights and observations from the literature under a unified and principled perspective, with clear directions open for future research. Our results draw particular attention to the interaction between momentum and batch-size scaling, suggesting that optimal performance may be achieved with several scaling strategies.

Deriving Hyperparameter Scaling Laws via Modern Optimization Theory

Abstract

Hyperparameter transfer has become an important component of modern large-scale training recipes. Existing methods, such as muP, primarily focus on transfer between model sizes, with transfer across batch sizes and training horizons often relying on empirical scaling rules informed by insights from timescale preservation, quadratic proxies, and continuous-time approximations. We study hyperparameter scaling laws for modern first-order optimizers through the lens of recent convergence bounds for methods based on the Linear Minimization Oracle (LMO), a framework that includes normalized SGD, signSGD (approximating Adam), and Muon. Treating bounds in recent literature as a proxy and minimizing them across different tuning regimes yields closed-form power-law schedules for learning rate, momentum, and batch size as functions of the iteration or token budget. Our analysis, holding model size fixed, recovers most insights and observations from the literature under a unified and principled perspective, with clear directions open for future research. Our results draw particular attention to the interaction between momentum and batch-size scaling, suggesting that optimal performance may be achieved with several scaling strategies.
Paper Structure (60 sections, 5 theorems, 89 equations, 11 figures, 3 tables)

This paper contains 60 sections, 5 theorems, 89 equations, 11 figures, 3 tables.

Key Result

Theorem 1

Fix $\alpha\in(0,1]$ and consider equation eq:perf-fixed-alpha.

Figures (11)

  • Figure 1: Verification of Theorem \ref{['thm:fixed-alpha']}. Shown are the trends of equation \ref{['eq:perfT-def']} ($C_1=C_2=C_3=1$) under the choice $\beta=1-\alpha=0.999$. More $\alpha$s can be found in App. \ref{['sec:additional_exp']}. In the two plots on the left, we show in magenta performance at the best value of $(\eta,b)$ for each token budget, following $\mathcal{O}(T^{-1/4})$. Plotted in blue are also performances for a fixed batch size, minimizing over $\eta$ at each token budget, and in green performances for a fixed learning rate, minimizing $b$ at each token budget. In the two plots on the right, we show how the optimal batch size and learning rate scale with tokens. The trends predicted by Theorem \ref{['thm:fixed-alpha']} hold after a burn-in phase, where the optimal batch size is $b=1$.
  • Figure 2: Numerical verification of Theorem \ref{['thm:fixed_batch_mom']} for $b=1072$. The setting is same as for Figure \ref{['fig:thm1']}.
  • Figure 3: Qualitative empirical support for our analysis. We perform a constant-$\eta$ training experiment on a $160$M transformer (PlainLM implementation ajroldi2024plainlm), trained with a language modeling objective for up to $5B$ tokens from SlimPajama soboleva2023slimpajama. The setup is the same as in Theorem \ref{['thm:fixed-alpha']} and Figure \ref{['fig:thm1']}. With $8\%$ warmup + constant learning rate (Adam betas fixed to $(0.9, 0.95)$), grid-searching $\eta$ across batch sizes shows clear structure: (i) at fixed batch size, the optimal $\eta$ decreases over training time (e.g., $b=32$ shifts from $0.003$ to $0.001$ after $2.5B$ tokens); (ii) this trend appears similarly for $b=128$, albeit the switching point is not yet reached at $5B$ tokens; and (iii) the optimal $\eta$ increases with batch size (at $T=5B$ tokens, $\eta=0.001$ at $b=32$, $\eta=0.003$ at $b=128$, $\eta=0.01$ at $b=512$). Constant $\eta$ allows clean finite-time comparisons that reveal both token-dependent and batch-dependent optimal $\eta$s.
  • Figure 4: Contours of best achievable performance versus batch size and training iterations for a fixed $\alpha$ and tuned $\eta$.
  • Figure 5: Numerical verification of Theorem \ref{['thm:fixed-alpha']} for $\beta=0.934$.
  • ...and 6 more figures

Theorems & Definitions (6)

  • Theorem 1: Fixed momentum, large-horizon proxy
  • Theorem 2: Fixed batch size, large horizon proxy
  • Theorem 3: Jointly tuned $\boldsymbol{(\eta,\alpha,b)}$ under a fixed token budget
  • Corollary 1: Momentum tuning and Batch size constraints
  • Corollary 2: Several batch size scalings are near-optimal
  • Remark 4: Continuum of feasible batch-growth paths under Theorem 2