Table of Contents
Fetching ...

Towards Robust Scaling Laws for Optimizers

Alexandra Volkova, Mher Safaryan, Christoph H. Lampert, Dan Alistarh

TL;DR

This paper investigates how optimizer choice interacts with scaling laws in large language model pretraining. It shows that fitting independent scaling laws per optimizer yields unstable and non-robust parameter estimates, and proposes a unified, shared-exponent scaling law with optimizer-specific rescalings ρ_N and ρ_D to enable stable, interpretable optimizer comparisons. The authors provide theoretical justification via convex-quadratic analysis and demonstrate empirical improvements in fit stability and extrapolation accuracy across AdamW, Muon, Scion, Shampoo, and SOAP on two model families. They also extend the framework to compute-based scaling and interpret optimizer effects in terms of parameter and data efficiency, offering practical guidance for resource planning at scale.

Abstract

The quality of Large Language Model (LLM) pretraining depends on multiple factors, including the compute budget and the choice of optimization algorithm. Empirical scaling laws are widely used to predict loss as model size and training data grow, however, almost all existing studies fix the optimizer (typically AdamW). At the same time, a new generation of optimizers (e.g., Muon, Shampoo, SOAP) promises faster and more stable convergence, but their relationship with model and data scaling is not yet well understood. In this work, we study scaling laws across different optimizers. Empirically, we show that 1) separate Chinchilla-style scaling laws for each optimizer are ill-conditioned and have highly correlated parameters. Instead, 2) we propose a more robust law with shared power-law exponents and optimizer-specific rescaling factors, which enable direct comparison between optimizers. Finally, 3) we provide a theoretical analysis of gradient-based methods for the proxy task of a convex quadratic objective, demonstrating that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.

Towards Robust Scaling Laws for Optimizers

TL;DR

This paper investigates how optimizer choice interacts with scaling laws in large language model pretraining. It shows that fitting independent scaling laws per optimizer yields unstable and non-robust parameter estimates, and proposes a unified, shared-exponent scaling law with optimizer-specific rescalings ρ_N and ρ_D to enable stable, interpretable optimizer comparisons. The authors provide theoretical justification via convex-quadratic analysis and demonstrate empirical improvements in fit stability and extrapolation accuracy across AdamW, Muon, Scion, Shampoo, and SOAP on two model families. They also extend the framework to compute-based scaling and interpret optimizer effects in terms of parameter and data efficiency, offering practical guidance for resource planning at scale.

Abstract

The quality of Large Language Model (LLM) pretraining depends on multiple factors, including the compute budget and the choice of optimization algorithm. Empirical scaling laws are widely used to predict loss as model size and training data grow, however, almost all existing studies fix the optimizer (typically AdamW). At the same time, a new generation of optimizers (e.g., Muon, Shampoo, SOAP) promises faster and more stable convergence, but their relationship with model and data scaling is not yet well understood. In this work, we study scaling laws across different optimizers. Empirically, we show that 1) separate Chinchilla-style scaling laws for each optimizer are ill-conditioned and have highly correlated parameters. Instead, 2) we propose a more robust law with shared power-law exponents and optimizer-specific rescaling factors, which enable direct comparison between optimizers. Finally, 3) we provide a theoretical analysis of gradient-based methods for the proxy task of a convex quadratic objective, demonstrating that Chinchilla-style scaling laws emerge naturally as a result of loss decomposition into irreducible, approximation, and optimization errors.
Paper Structure (26 sections, 3 theorems, 35 equations, 3 figures, 7 tables, 3 algorithms)

This paper contains 26 sections, 3 theorems, 35 equations, 3 figures, 7 tables, 3 algorithms.

Key Result

Theorem 6.3

If the measure spectral-measure associated with eigenvalues $(\lambda_i)_{i\ge1}$ admits spectral dimension $\omega>0$, then we have two phases for the loss scalingNotation $a_k = \Theta(b_k)$ means the ratio $a_k/b_k$ converges to some positive finite constant as $k\to\infty$.. Phase 1. If $k\lambd Phase 2. Otherwise, if $k\lambda_d>1$, the power law eventually saturates into exponential rate for

Figures (3)

  • Figure 1: Correlation between estimated hyperparameters $A, \alpha$ and $B, \beta$ for leave-one-out cross-validation.
  • Figure 2: Prediction error (MSE; lower is better) across optimizers for two scaling-law parameterizations: independent per-optimizer Chinchilla fits (“Naive per-optimizer law”) versus our shared-exponent law with optimizer-specific rescaling factors (“Shared-parameter law”). Our approach more than halves the error.
  • Figure 3: Loss as a function of model size for OLMo family models for token-to-parameter ratios 30 and 200. Data points are measured runs, lines are best-fit scaling curves.

Theorems & Definitions (9)

  • Definition 6.1: Spectral truncation
  • Definition 6.2: Spectral dimension
  • Theorem 6.3: Theoretical scaling law
  • Definition 3.1: Spectral dimension
  • Theorem 3.2
  • proof
  • Definition 3.3: Width-$d$ Model
  • Theorem 3.4
  • proof