Table of Contents
Fetching ...

Spectral Condition for $μ$P under Width-Depth Scaling

Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, Chongxuan Li

TL;DR

This work develops a simple and unified spectral framework for $\mu$P under joint width-depth scaling and derives a general recipe for implementing $\mu$P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations.

Abstract

Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($μ$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $μ$P under joint width-depth scaling. Considering residual networks of varying block depths, we first introduce a spectral $μ$P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate $μ$P formulations as special cases. Building on this condition, we then derive a general recipe for implementing $μ$P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations. This approach not only recovers existing $μ$P formulations (e.g., for SGD and AdamW) but also naturally extends to a wider range of optimizers. Finally, experiments on GPT-2 style language models demonstrate that the proposed spectral $μ$P condition preserves stable feature learning and enables robust HP transfer under width-depth scaling.

Spectral Condition for $μ$P under Width-Depth Scaling

TL;DR

This work develops a simple and unified spectral framework for P under joint width-depth scaling and derives a general recipe for implementing P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations.

Abstract

Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization (P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for P under joint width-depth scaling. Considering residual networks of varying block depths, we first introduce a spectral P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate P formulations as special cases. Building on this condition, we then derive a general recipe for implementing P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations. This approach not only recovers existing P formulations (e.g., for SGD and AdamW) but also naturally extends to a wider range of optimizers. Finally, experiments on GPT-2 style language models demonstrate that the proposed spectral P condition preserves stable feature learning and enables robust HP transfer under width-depth scaling.
Paper Structure (127 sections, 206 equations, 6 figures, 13 tables)

This paper contains 127 sections, 206 equations, 6 figures, 13 tables.

Figures (6)

  • Figure 1: Feature learning and HP transfer under SP and $\mu$P. We train GPT-2 style Transformer language models with Muon-Kimi and AdamW using SP and the width-depth $\mu$P derived in Tables \ref{['tab: muon-kimi mup']} and \ref{['tab: adamw-mup']}. $\mu$P maintains stable feature norms and enables robust HP transfer across both width and depth scaling, while consistently achieving lower loss than SP as the width and depth increase.
  • Figure 2: Feature learning and HP transfer under SP and $\mu$P without LayerNorm. We compare SP and $\mu$P along two dimensions. First, in terms of training stability, SP becomes increasingly prone to loss divergence as depth increases in the absence of LayerNorm, whereas $\mu$P enables stable training. Second, unlike SP, $\mu$P preserves HP transferability at large depths without LayerNorm.
  • Figure 3: Validation of Assumption \ref{['ass:extension_1_update']} (Weight Update). The ratio $\frac{\Vert{\bm{W}}_l + \Delta{\bm{W}}_l\Vert_\mathrm{R}}{ \Vert{\bm{W}}_l\Vert_\mathrm{R} + \Vert\Delta{\bm{W}}_l\Vert_\mathrm{R}}$ remains constant near 1 across depth for the input layer and residual block layers, showing non-vanishing updates throughout multiple-step training.
  • Figure 4: Validation of Assumption \ref{['ass:extension_1_update']} (Feature Update). The ratio $\frac{\Vert{\bm{h}}_l + \Delta{\bm{h}}_l\Vert_\mathrm{R}}{ \Vert{\bm{h}}_l\Vert_\mathrm{R} + \Vert\Delta{\bm{h}}_l\Vert_\mathrm{R}}$ remains around constant near 1 across varying depths, showing non-vanishing updates throughout multiple-step training.
  • Figure 5: Validation of Assumption \ref{['ass:extension_2_stable_act']} (Stable Activation). The ratio of post-activation to pre-activation norms $\frac{\Vert \phi({\bm{W}}_l{\bm{h}}_{l-1})\Vert_\mathrm{R}}{\Vert {\bm{W}}_l{\bm{h}}_{l-1}\Vert_\mathrm{R}}$ remains stable across varying depths, confirming that the ReLU activation does not collapse the norm in non-linear networks.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Claim E.1: Alignment of initial weight matrices
  • proof
  • Claim E.2: Alignment of updates
  • proof