Spectral Condition for $μ$P under Width-Depth Scaling

Chenyu Zheng; Rongzhen Wang; Xinyu Zhang; Chongxuan Li

Spectral Condition for $μ$P under Width-Depth Scaling

Chenyu Zheng, Rongzhen Wang, Xinyu Zhang, Chongxuan Li

TL;DR

This work develops a simple and unified spectral framework for $\mu$P under joint width-depth scaling and derives a general recipe for implementing $\mu$P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations.

Abstract

Generative foundation models are increasingly scaled in both width and depth, posing significant challenges for stable feature learning and reliable hyperparameter (HP) transfer across model sizes. While maximal update parameterization ($μ$P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for $μ$P under joint width-depth scaling. Considering residual networks of varying block depths, we first introduce a spectral $μ$P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate $μ$P formulations as special cases. Building on this condition, we then derive a general recipe for implementing $μ$P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations. This approach not only recovers existing $μ$P formulations (e.g., for SGD and AdamW) but also naturally extends to a wider range of optimizers. Finally, experiments on GPT-2 style language models demonstrate that the proposed spectral $μ$P condition preserves stable feature learning and enables robust HP transfer under width-depth scaling.

Spectral Condition for $μ$P under Width-Depth Scaling

TL;DR

This work develops a simple and unified spectral framework for

P under joint width-depth scaling and derives a general recipe for implementing

P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations.

Abstract

P) has provided a principled solution to both problems for width scaling, existing extensions to the joint width-depth scaling regime remain fragmented, architecture- and optimizer-specific, and often rely on technically involved theories. In this work, we develop a simple and unified spectral framework for

P under joint width-depth scaling. Considering residual networks of varying block depths, we first introduce a spectral

P condition that precisely characterizes how the norms of weights and their per-step updates should scale with width and depth, unifying previously disparate

P formulations as special cases. Building on this condition, we then derive a general recipe for implementing

P across a broad class of optimizers by mapping the spectral constraints to concrete HP parameterizations. This approach not only recovers existing

P formulations (e.g., for SGD and AdamW) but also naturally extends to a wider range of optimizers. Finally, experiments on GPT-2 style language models demonstrate that the proposed spectral

P condition preserves stable feature learning and enables robust HP transfer under width-depth scaling.

Paper Structure (127 sections, 206 equations, 6 figures, 13 tables)

This paper contains 127 sections, 206 equations, 6 figures, 13 tables.

Introduction
Preliminaries
Mathematical Notations and Properties
Spectral Condition for $\mu$P under Width Scaling
Theoretical setup.
$\mu$P principle and its spectral condition.
Current Limitation.
Spectral Condition for $\mu$P under Width-Depth Scaling
Problem Setup
Spectral Scaling Condition
Theoretical Derivation
Preliminary Initial Condition
Input layer.
Hidden layers.
Output layer.
...and 112 more sections

Figures (6)

Figure 1: Feature learning and HP transfer under SP and $\mu$P. We train GPT-2 style Transformer language models with Muon-Kimi and AdamW using SP and the width-depth $\mu$P derived in Tables \ref{['tab: muon-kimi mup']} and \ref{['tab: adamw-mup']}. $\mu$P maintains stable feature norms and enables robust HP transfer across both width and depth scaling, while consistently achieving lower loss than SP as the width and depth increase.
Figure 2: Feature learning and HP transfer under SP and $\mu$P without LayerNorm. We compare SP and $\mu$P along two dimensions. First, in terms of training stability, SP becomes increasingly prone to loss divergence as depth increases in the absence of LayerNorm, whereas $\mu$P enables stable training. Second, unlike SP, $\mu$P preserves HP transferability at large depths without LayerNorm.
Figure 3: Validation of Assumption \ref{['ass:extension_1_update']} (Weight Update). The ratio $\frac{\Vert{\bm{W}}_l + \Delta{\bm{W}}_l\Vert_\mathrm{R}}{ \Vert{\bm{W}}_l\Vert_\mathrm{R} + \Vert\Delta{\bm{W}}_l\Vert_\mathrm{R}}$ remains constant near 1 across depth for the input layer and residual block layers, showing non-vanishing updates throughout multiple-step training.
Figure 4: Validation of Assumption \ref{['ass:extension_1_update']} (Feature Update). The ratio $\frac{\Vert{\bm{h}}_l + \Delta{\bm{h}}_l\Vert_\mathrm{R}}{ \Vert{\bm{h}}_l\Vert_\mathrm{R} + \Vert\Delta{\bm{h}}_l\Vert_\mathrm{R}}$ remains around constant near 1 across varying depths, showing non-vanishing updates throughout multiple-step training.
Figure 5: Validation of Assumption \ref{['ass:extension_2_stable_act']} (Stable Activation). The ratio of post-activation to pre-activation norms $\frac{\Vert \phi({\bm{W}}_l{\bm{h}}_{l-1})\Vert_\mathrm{R}}{\Vert {\bm{W}}_l{\bm{h}}_{l-1}\Vert_\mathrm{R}}$ remains stable across varying depths, confirming that the ReLU activation does not collapse the norm in non-linear networks.
...and 1 more figures

Theorems & Definitions (4)

Claim E.1: Alignment of initial weight matrices
proof
Claim E.2: Alignment of updates
proof

Spectral Condition for $μ$P under Width-Depth Scaling

TL;DR

Abstract

Spectral Condition for $μ$P under Width-Depth Scaling

Authors

TL;DR

Abstract

Table of Contents

Figures (6)

Theorems & Definitions (4)