Table of Contents
Fetching ...

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Liliang Ren, Yang Liu, Yelong Shen, Weizhu Chen

Abstract

Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth-$μ$P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the "magic exponent" 0.32 previously observed for AdamW. A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding $1.58\times$ compute efficiency over a strong Muon baseline at $6\times10^{21}$ FLOPs. Moreover, HyperP delivers transferable stability: all monitored instability indicators, including $Z$-values, output RMS, and activation outliers, remain bounded and non-increasing under training FLOPs scaling. We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling, and show that hypersphere optimization enables substantially larger auxiliary load-balancing weights, yielding both strong performance and good expert balance. We release our training codebase at https://github.com/microsoft/ArchScale.

Rethinking Language Model Scaling under Transferable Hypersphere Optimization

Abstract

Scaling laws for large language models depend critically on the optimizer and parameterization. Existing hyperparameter transfer laws are mainly developed for first-order optimizers, and they do not structurally prevent training instability at scale. Recent hypersphere optimization methods constrain weight matrices to a fixed-norm hypersphere, offering a promising alternative for more stable scaling. We introduce HyperP (Hypersphere Parameterization), the first framework for transferring optimal learning rates across model width, depth, training tokens, and Mixture-of-Experts (MoE) granularity under the Frobenius-sphere constraint with the Muon optimizer. We prove that weight decay is a first-order no-op on the Frobenius sphere, show that Depth-P remains necessary, and find that the optimal learning rate follows the same data-scaling power law with the "magic exponent" 0.32 previously observed for AdamW. A single base learning rate tuned at the smallest scale transfers across all compute budgets under HyperP, yielding compute efficiency over a strong Muon baseline at FLOPs. Moreover, HyperP delivers transferable stability: all monitored instability indicators, including -values, output RMS, and activation outliers, remain bounded and non-increasing under training FLOPs scaling. We also propose SqrtGate, an MoE gating mechanism derived from the hypersphere constraint that preserves output RMS across MoE granularities for improved granularity scaling, and show that hypersphere optimization enables substantially larger auxiliary load-balancing weights, yielding both strong performance and good expert balance. We release our training codebase at https://github.com/microsoft/ArchScale.

Paper Structure

This paper contains 42 sections, 8 theorems, 51 equations, 15 figures, 16 tables.

Key Result

Theorem 1

Let $W\in\mathbb{R}^{d_{\mathrm{out}}\times d_{\mathrm{in}}}$ satisfy $\|W\|_F=c_W$, and define Then, for sufficiently small $\|\Delta\|_F$, where $\Pi_T(\Delta) = \Delta - \frac{\langle \Delta, W\rangle_F}{\|W\|_F^2}W$ is the tangent-space projection at $W$.

Figures (15)

  • Figure 1: Left: Loss vs. LR at different token budgets. Right: Fitted optimal LR vs. training tokens on log-log scale, showing a clean power-law relationship with exponent $0.32$. The exact values are reported in \ref{['tab:data-scaling']}.
  • Figure 2: Validation loss vs. learning rate for Muon (sweeping weight decay $\lambda$) and MuonH ($\lambda{=}0$). MuonH achieves comparable optimality with a simpler hyperparameter space.
  • Figure 3: Loss vs. LR curves across model sizes with Depth-$\mu$P (left) and without Depth-$\mu$P (right). Depth-$\mu$P keeps the optimal LR stable at $\eta^* \approx 0.014$ -- $0.016$ across all depths, while the optimum drifts from $\eta^* = 0.016$ at $d{=}8$ to $\eta^* = 0.008$ at $d{=}24$ without Depth-$\mu$P.
  • Figure 4: Left: Loss vs. LR at different batch sizes. Right: Optimal LR vs. batch size on log-log scale. The exact values are reported in \ref{['tab:bsz-scaling']}.
  • Figure 5: Loss vs. LR curves for three auxiliary loss weights. The curves nearly overlap, indicating robustness on $\gamma$ under hypersphere optimization. The exact values across all LR and $\gamma$ combinations are reported in \ref{['tab:auxloss']}.
  • ...and 10 more figures

Theorems & Definitions (12)

  • Theorem 1: First-order form of Frobenius-sphere updates
  • Corollary 1.1: Weight decay is a first-order no-op
  • Theorem 2: Width transfer under Frobenius sphere
  • Theorem 3: Depth scaling under Frobenius-sphere optimization
  • Proposition 4: Bounded Logits under Hypersphere Constraint
  • Proposition 5: Classical gating is $k$-dependent
  • Proposition 6: SqrtGate is approximately $k$-invariant
  • Lemma 7: Spectral--Frobenius sandwich
  • proof
  • proof : Proof of Theorem \ref{['thm:width-scaling-hyperball']}
  • ...and 2 more