Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets
Haosong Zhang, Shenxi Wu, Yichi Zhang, Xi Chen, Wei Lin
TL;DR
The authors address the challenge of selecting learning rates for deep CNNs and residual networks by extending the maximal-update μP framework to a network-wide budget, AM‑μP, which constrains the average one-step pre-activation update. Paired with a residual-aware He initialization, this approach yields width-robust depth laws and a unified depth–LR relationship: η⋆(L) ∝ L^(-3/2) for CNNs/MLPs and η⋆(L)=Θ(L^(-3/2)) for ResNets, with only subleading boundary or width corrections. The theory is supported by extensive experiments on CIFAR-10/100 and ImageNet, showing stable −3/2 scaling and successful zero-shot LR transfer across depths and architectures. Practically, AM‑μP provides a plug-and-play LR principle that reduces tuning overhead and improves reproducibility for large-scale training of convolutional and residual architectures, and lays groundwork for extending to other architectures like Transformers. Overall, the work unifies LR settings across depth, width, and architectural variations, offering a principled default for initial learning-rate choice in modern deep networks.
Abstract
Choosing an appropriate learning rate remains a key challenge in scaling depth of modern deep networks. The classical maximal update parameterization ($μ$P) enforces a fixed per-layer update magnitude, which is well suited to homogeneous multilayer perceptrons (MLPs) but becomes ill-posed in heterogeneous architectures where residual accumulation and convolutions introduce imbalance across layers. We introduce Arithmetic-Mean $μ$P (AM-$μ$P), which constrains not each individual layer but the network-wide average one-step pre-activation second moment to a constant scale. Combined with a residual-aware He fan-in initialization - scaling residual-branch weights by the number of blocks ($\mathrm{Var}[W]=c/(K\cdot \mathrm{fan\text{-}in})$) - AM-$μ$P yields width-robust depth laws that transfer consistently across depths. We prove that, for one- and two-dimensional convolutional networks, the maximal-update learning rate satisfies $η^\star(L)\propto L^{-3/2}$; with zero padding, boundary effects are constant-level as $N\gg k$. For standard residual networks with general conv+MLP blocks, we establish $η^\star(L)=Θ(L^{-3/2})$, with $L$ the minimal depth. Empirical results across a range of depths confirm the $-3/2$ scaling law and enable zero-shot learning-rate transfer, providing a unified and practical LR principle for convolutional and deep residual networks without additional tuning overhead.
