Table of Contents
Fetching ...

Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks

Shenxi Wu, Haosong Zhang, Xingjian Ma, Shirui Bian, Yichi Zhang, Xi Chen, Wei Lin

TL;DR

The paper tackles the cost of hyperparameter tuning at scale by deriving a depth-aware transfer rule for non-recurrent, multi-path networks. It extends Maximal Update Parametrization to a network-wide Arithmetic-Mean μP budget and introduces the notion of effective depth, proving a universal depth decay for the learning rate, $\eta_\star \propto L^{-3/2}$, across CNNs, ResNets, and Transformers. The authors validate the law on CIFAR-10/100 and ImageNet subsets using automated LR searches, finding fitted slopes near $-1.5$ and demonstrating robust zero-shot transfer of LR across depths and widths. This work provides a practical, architecture-agnostic guideline for rescaling learning rates when altering depth, substantially reducing hyperparameter-tuning requirements in modern deep learning models.

Abstract

Deeper modern architectures are costly to train, making hyperparameter transfer preferable to expensive repeated tuning. Maximal Update Parametrization ($μ$P) helps explain why many hyperparameters transfer across width. Yet depth scaling is less understood for modern architectures, whose computation graphs contain multiple parallel paths and residual aggregation. To unify various non-recurrent multi-path neural networks such as CNNs, ResNets, and Transformers, we introduce a graph-based notion of effective depth. Under stabilizing initializations and a maximal-update criterion, we show that the optimal learning rate decays with effective depth following a universal -3/2 power law. Here, the maximal-update criterion maximizes the typical one-step representation change at initialization without causing instability, and effective depth is the minimal path length from input to output, counting layers and residual additions. Experiments across diverse architectures confirm the predicted slope and enable reliable zero-shot transfer of learning rates across depths and widths, turning depth scaling into a predictable hyperparameter-transfer problem.

Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks

TL;DR

The paper tackles the cost of hyperparameter tuning at scale by deriving a depth-aware transfer rule for non-recurrent, multi-path networks. It extends Maximal Update Parametrization to a network-wide Arithmetic-Mean μP budget and introduces the notion of effective depth, proving a universal depth decay for the learning rate, , across CNNs, ResNets, and Transformers. The authors validate the law on CIFAR-10/100 and ImageNet subsets using automated LR searches, finding fitted slopes near and demonstrating robust zero-shot transfer of LR across depths and widths. This work provides a practical, architecture-agnostic guideline for rescaling learning rates when altering depth, substantially reducing hyperparameter-tuning requirements in modern deep learning models.

Abstract

Deeper modern architectures are costly to train, making hyperparameter transfer preferable to expensive repeated tuning. Maximal Update Parametrization (P) helps explain why many hyperparameters transfer across width. Yet depth scaling is less understood for modern architectures, whose computation graphs contain multiple parallel paths and residual aggregation. To unify various non-recurrent multi-path neural networks such as CNNs, ResNets, and Transformers, we introduce a graph-based notion of effective depth. Under stabilizing initializations and a maximal-update criterion, we show that the optimal learning rate decays with effective depth following a universal -3/2 power law. Here, the maximal-update criterion maximizes the typical one-step representation change at initialization without causing instability, and effective depth is the minimal path length from input to output, counting layers and residual additions. Experiments across diverse architectures confirm the predicted slope and enable reliable zero-shot transfer of learning rates across depths and widths, turning depth scaling into a predictable hyperparameter-transfer problem.
Paper Structure (47 sections, 14 theorems, 93 equations, 7 figures, 3 tables)

This paper contains 47 sections, 14 theorems, 93 equations, 7 figures, 3 tables.

Key Result

Proposition 1

Consider a depth-$L$ CNN with stride $1$, fan-in initialization (Eq. equation eq:fanin-init), and pointwise activation $\sigma$. Assume the homogeneity conditions stated in Sec. sec:method:init (up to padding-induced boundary non-uniformity and finite-width effects). Then there exists a constant $\k In particular, the leading depth exponent is $-3/2$, and the listed terms only affect the prefactor

Figures (7)

  • Figure 1: Overview of prior work and our results on depth-wise learning-rate scaling. 1 denotes one depth unit, d and D denote width. (a) Prior work mainly analyzes sequential networks or special cases, and Transformers are often discussed only under width scaling. (b) We treat CNNs, ResNets, and Transformers as non-recurrent multi-path networks and obtain a unified depth law for the maximal-update learning rate.
  • Figure 2: Depth convention for residual networks. Depth is defined as the minimal path length. Along the minimal path, each plain layer and each residual block contributes one depth unit. If the backbone has $m$ plain layers and $K$ residual blocks, then the effective depth is $L=m+K$.
  • Figure 3: Depth--LR scaling on CIFAR-10. (a) CNN: grid-searched optima with 95% CIs and weighted global fit ($\hat{\alpha}=-1.339$). (b) ResNet: our AM-$\mu$P theory ($\hat{\alpha}=-1.435$) closely matches empirical data, while PathSum chen2024principled shows increasing deviation at larger depths. (C) Transformer: Depth--LR scaling for ViT variants ViT ($\hat{\alpha}=-1.44$), BEiT ($\hat{\alpha}=-1.35$), and CCT ($\hat{\alpha}=-1.45$) all exhibit clear power-law scaling consistent with our theoretical prediction of $-1.5$.
  • Figure 4: Loss landscapes for ViT variants on CIFAR-10. Color encodes depth (purple: shallow, yellow: deep). All three variants show systematic leftward shifts in optimal LR as depth increases, consistent with the predicted depth--LR scaling relationship.
  • Figure 5: Depth--LR scaling curves for ablation configurations. Each subplot shows $\log_{10}\eta^\star$ versus $\log_{10}L$ with the fitted power-law line. Configurations correspond to Table \ref{['tab:ablation-all']}. All configurations exhibit clear power-law relationships with exponents close to the theoretical prediction of $-1.5$.
  • ...and 2 more figures

Theorems & Definitions (29)

  • Proposition : Depthwise LR scale for 1D/2D CNNs
  • Proposition : Depthwise LR scale for residual networks
  • Proposition : Depthwise LR scale for Transformers
  • Lemma 1: Top-layer reduction for CNN overlaps
  • proof
  • Remark : Zero padding and boundary corrections
  • Lemma 2: Second-moment decomposition
  • proof
  • Remark : Top-layer reduction inside the $B$-term
  • Lemma 3: Boundary missing-term bound for zero padding
  • ...and 19 more