Table of Contents
Fetching ...

Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets

Haosong Zhang, Shenxi Wu, Yichi Zhang, Xi Chen, Wei Lin

TL;DR

The authors address the challenge of selecting learning rates for deep CNNs and residual networks by extending the maximal-update μP framework to a network-wide budget, AM‑μP, which constrains the average one-step pre-activation update. Paired with a residual-aware He initialization, this approach yields width-robust depth laws and a unified depth–LR relationship: η⋆(L) ∝ L^(-3/2) for CNNs/MLPs and η⋆(L)=Θ(L^(-3/2)) for ResNets, with only subleading boundary or width corrections. The theory is supported by extensive experiments on CIFAR-10/100 and ImageNet, showing stable −3/2 scaling and successful zero-shot LR transfer across depths and architectures. Practically, AM‑μP provides a plug-and-play LR principle that reduces tuning overhead and improves reproducibility for large-scale training of convolutional and residual architectures, and lays groundwork for extending to other architectures like Transformers. Overall, the work unifies LR settings across depth, width, and architectural variations, offering a principled default for initial learning-rate choice in modern deep networks.

Abstract

Choosing an appropriate learning rate remains a key challenge in scaling depth of modern deep networks. The classical maximal update parameterization ($μ$P) enforces a fixed per-layer update magnitude, which is well suited to homogeneous multilayer perceptrons (MLPs) but becomes ill-posed in heterogeneous architectures where residual accumulation and convolutions introduce imbalance across layers. We introduce Arithmetic-Mean $μ$P (AM-$μ$P), which constrains not each individual layer but the network-wide average one-step pre-activation second moment to a constant scale. Combined with a residual-aware He fan-in initialization - scaling residual-branch weights by the number of blocks ($\mathrm{Var}[W]=c/(K\cdot \mathrm{fan\text{-}in})$) - AM-$μ$P yields width-robust depth laws that transfer consistently across depths. We prove that, for one- and two-dimensional convolutional networks, the maximal-update learning rate satisfies $η^\star(L)\propto L^{-3/2}$; with zero padding, boundary effects are constant-level as $N\gg k$. For standard residual networks with general conv+MLP blocks, we establish $η^\star(L)=Θ(L^{-3/2})$, with $L$ the minimal depth. Empirical results across a range of depths confirm the $-3/2$ scaling law and enable zero-shot learning-rate transfer, providing a unified and practical LR principle for convolutional and deep residual networks without additional tuning overhead.

Arithmetic-Mean $μ$P for Modern Architectures: A Unified Learning-Rate Scale for CNNs and ResNets

TL;DR

The authors address the challenge of selecting learning rates for deep CNNs and residual networks by extending the maximal-update μP framework to a network-wide budget, AM‑μP, which constrains the average one-step pre-activation update. Paired with a residual-aware He initialization, this approach yields width-robust depth laws and a unified depth–LR relationship: η⋆(L) ∝ L^(-3/2) for CNNs/MLPs and η⋆(L)=Θ(L^(-3/2)) for ResNets, with only subleading boundary or width corrections. The theory is supported by extensive experiments on CIFAR-10/100 and ImageNet, showing stable −3/2 scaling and successful zero-shot LR transfer across depths and architectures. Practically, AM‑μP provides a plug-and-play LR principle that reduces tuning overhead and improves reproducibility for large-scale training of convolutional and residual architectures, and lays groundwork for extending to other architectures like Transformers. Overall, the work unifies LR settings across depth, width, and architectural variations, offering a principled default for initial learning-rate choice in modern deep networks.

Abstract

Choosing an appropriate learning rate remains a key challenge in scaling depth of modern deep networks. The classical maximal update parameterization (P) enforces a fixed per-layer update magnitude, which is well suited to homogeneous multilayer perceptrons (MLPs) but becomes ill-posed in heterogeneous architectures where residual accumulation and convolutions introduce imbalance across layers. We introduce Arithmetic-Mean P (AM-P), which constrains not each individual layer but the network-wide average one-step pre-activation second moment to a constant scale. Combined with a residual-aware He fan-in initialization - scaling residual-branch weights by the number of blocks () - AM-P yields width-robust depth laws that transfer consistently across depths. We prove that, for one- and two-dimensional convolutional networks, the maximal-update learning rate satisfies ; with zero padding, boundary effects are constant-level as . For standard residual networks with general conv+MLP blocks, we establish , with the minimal depth. Empirical results across a range of depths confirm the scaling law and enable zero-shot learning-rate transfer, providing a unified and practical LR principle for convolutional and deep residual networks without additional tuning overhead.

Paper Structure

This paper contains 65 sections, 12 theorems, 91 equations, 10 figures.

Key Result

Theorem 1

Let the spatial dimension be $d\in\{1,2\}$ and consider a homogeneous convolutional block with stride $=1$, circular padding, ReLU activation, and He fan-in initialization. For arbitrary channel widths $\{C_\ell\}$, arbitrary kernel supports $\mathcal{K}_\ell\subset\mathbb{Z}^d$ (of any size/shape), where $\kappa$ depends only on the activation/initialization fixed point and is independent of $\{C

Figures (10)

  • Figure 1: Depth convention (minimal-path). Depth equals the minimal path length; each residual block counts as 1.
  • Figure 2: Global depth--LR scaling on CIFAR-10. (a) CNN: grid-searched optima with 95% CIs and a weighted global fit. (b) ResNet: global fit alongside AM-$\mu$P theory and PathSum Chen2024.
  • Figure 3: CNN on CIFAR-10 (GELU): full panel. Top-left: segmented predictions using two anchor depths per segment (A/B/C). Top-right: global power-law fit of $\eta^\star$ vs. $L$ with slope $\hat{\alpha} \approx -1.38$ (red dashed), shown against a reference line (green dash-dotted). Bottom-left: relative errors by segment, with larger deviations near segment boundaries and at the largest depths. Bottom-right: linear-scale view showing the rapid decay of the maximal-update learning rate $\eta^\star$ with depth.
  • Figure 4: CNN padding comparison on CIFAR-10 (ReLU).
  • Figure 5: CNN on CIFAR-100 (ReLU): full panel. Top-left: segmented predictions using two anchor depths per segment (A/B/C). Top-right: global power-law fit of $\eta^\star$ vs. $L$ with slope $\hat{\alpha} \approx -1.392$ (red dashed), closely tracking the $L^{-3/2}$ reference (green dash-dotted). Bottom-left: relative errors by segment, with larger deviations near segment boundaries and at the largest depths. Bottom-right: linear-scale view showing the rapid decay of the maximal-update learning rate $\eta^\star$ with depth.
  • ...and 5 more figures

Theorems & Definitions (18)

  • Theorem 1: Width-invariant depth scaling for homogeneous conv blocks in 1D/2D
  • Theorem 2: Finite-width, boundary, and mini-batch corrections in 1D/2D
  • Theorem 3: $\mu$P scaling law for ResNets
  • Lemma 1: Layerwise conditional expectation invariance (1D CNN, stride $=1$)
  • proof
  • Corollary : Layerwise invariance in expectation
  • Remark
  • Lemma 2: Second-moment decomposition of pre-activation changes in CNNs (top-layer form)
  • Corollary : Top-layer reduction via layerwise invariance
  • Remark
  • ...and 8 more