Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks

Shenxi Wu; Haosong Zhang; Xingjian Ma; Shirui Bian; Yichi Zhang; Xi Chen; Wei Lin

Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks

Shenxi Wu, Haosong Zhang, Xingjian Ma, Shirui Bian, Yichi Zhang, Xi Chen, Wei Lin

TL;DR

The paper tackles the cost of hyperparameter tuning at scale by deriving a depth-aware transfer rule for non-recurrent, multi-path networks. It extends Maximal Update Parametrization to a network-wide Arithmetic-Mean μP budget and introduces the notion of effective depth, proving a universal depth decay for the learning rate, $\eta_\star \propto L^{-3/2}$, across CNNs, ResNets, and Transformers. The authors validate the law on CIFAR-10/100 and ImageNet subsets using automated LR searches, finding fitted slopes near $-1.5$ and demonstrating robust zero-shot transfer of LR across depths and widths. This work provides a practical, architecture-agnostic guideline for rescaling learning rates when altering depth, substantially reducing hyperparameter-tuning requirements in modern deep learning models.

Abstract

Deeper modern architectures are costly to train, making hyperparameter transfer preferable to expensive repeated tuning. Maximal Update Parametrization ($μ$P) helps explain why many hyperparameters transfer across width. Yet depth scaling is less understood for modern architectures, whose computation graphs contain multiple parallel paths and residual aggregation. To unify various non-recurrent multi-path neural networks such as CNNs, ResNets, and Transformers, we introduce a graph-based notion of effective depth. Under stabilizing initializations and a maximal-update criterion, we show that the optimal learning rate decays with effective depth following a universal -3/2 power law. Here, the maximal-update criterion maximizes the typical one-step representation change at initialization without causing instability, and effective depth is the minimal path length from input to output, counting layers and residual additions. Experiments across diverse architectures confirm the predicted slope and enable reliable zero-shot transfer of learning rates across depths and widths, turning depth scaling into a predictable hyperparameter-transfer problem.

Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks

TL;DR

, across CNNs, ResNets, and Transformers. The authors validate the law on CIFAR-10/100 and ImageNet subsets using automated LR searches, finding fitted slopes near

and demonstrating robust zero-shot transfer of LR across depths and widths. This work provides a practical, architecture-agnostic guideline for rescaling learning rates when altering depth, substantially reducing hyperparameter-tuning requirements in modern deep learning models.

Abstract

Deeper modern architectures are costly to train, making hyperparameter transfer preferable to expensive repeated tuning. Maximal Update Parametrization (

P) helps explain why many hyperparameters transfer across width. Yet depth scaling is less understood for modern architectures, whose computation graphs contain multiple parallel paths and residual aggregation. To unify various non-recurrent multi-path neural networks such as CNNs, ResNets, and Transformers, we introduce a graph-based notion of effective depth. Under stabilizing initializations and a maximal-update criterion, we show that the optimal learning rate decays with effective depth following a universal -3/2 power law. Here, the maximal-update criterion maximizes the typical one-step representation change at initialization without causing instability, and effective depth is the minimal path length from input to output, counting layers and residual additions. Experiments across diverse architectures confirm the predicted slope and enable reliable zero-shot transfer of learning rates across depths and widths, turning depth scaling into a predictable hyperparameter-transfer problem.

Paper Structure (47 sections, 14 theorems, 93 equations, 7 figures, 3 tables)

This paper contains 47 sections, 14 theorems, 93 equations, 7 figures, 3 tables.

Introduction
Main contributions.
Related Works
Methods
Initialization and architectural conventions
Maximal-update parameterizations and the arithmetic-mean criterion
Depthwise learning-rate scaling for CNNs
Depthwise learning-rate scaling for residual networks
Depthwise learning-rate scaling for Transformers
Experiments
General Protocol
Convolutional Networks
Residual Networks
Vision Transformers
Ablation Studies
...and 32 more sections

Key Result

Proposition 1

Consider a depth-$L$ CNN with stride $1$, fan-in initialization (Eq. equation eq:fanin-init), and pointwise activation $\sigma$. Assume the homogeneity conditions stated in Sec. sec:method:init (up to padding-induced boundary non-uniformity and finite-width effects). Then there exists a constant $\k In particular, the leading depth exponent is $-3/2$, and the listed terms only affect the prefactor

Figures (7)

Figure 1: Overview of prior work and our results on depth-wise learning-rate scaling. 1 denotes one depth unit, d and D denote width. (a) Prior work mainly analyzes sequential networks or special cases, and Transformers are often discussed only under width scaling. (b) We treat CNNs, ResNets, and Transformers as non-recurrent multi-path networks and obtain a unified depth law for the maximal-update learning rate.
Figure 2: Depth convention for residual networks. Depth is defined as the minimal path length. Along the minimal path, each plain layer and each residual block contributes one depth unit. If the backbone has $m$ plain layers and $K$ residual blocks, then the effective depth is $L=m+K$.
Figure 3: Depth--LR scaling on CIFAR-10. (a) CNN: grid-searched optima with 95% CIs and weighted global fit ($\hat{\alpha}=-1.339$). (b) ResNet: our AM-$\mu$P theory ($\hat{\alpha}=-1.435$) closely matches empirical data, while PathSum chen2024principled shows increasing deviation at larger depths. (C) Transformer: Depth--LR scaling for ViT variants ViT ($\hat{\alpha}=-1.44$), BEiT ($\hat{\alpha}=-1.35$), and CCT ($\hat{\alpha}=-1.45$) all exhibit clear power-law scaling consistent with our theoretical prediction of $-1.5$.
Figure 4: Loss landscapes for ViT variants on CIFAR-10. Color encodes depth (purple: shallow, yellow: deep). All three variants show systematic leftward shifts in optimal LR as depth increases, consistent with the predicted depth--LR scaling relationship.
Figure 5: Depth--LR scaling curves for ablation configurations. Each subplot shows $\log_{10}\eta^\star$ versus $\log_{10}L$ with the fitted power-law line. Configurations correspond to Table \ref{['tab:ablation-all']}. All configurations exhibit clear power-law relationships with exponents close to the theoretical prediction of $-1.5$.
...and 2 more figures

Theorems & Definitions (29)

Proposition : Depthwise LR scale for 1D/2D CNNs
Proposition : Depthwise LR scale for residual networks
Proposition : Depthwise LR scale for Transformers
Lemma 1: Top-layer reduction for CNN overlaps
proof
Remark : Zero padding and boundary corrections
Lemma 2: Second-moment decomposition
proof
Remark : Top-layer reduction inside the $B$-term
Lemma 3: Boundary missing-term bound for zero padding
...and 19 more

Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks

TL;DR

Abstract

Hyperparameter Transfer Laws for Non-Recurrent Multi-Path Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (7)

Theorems & Definitions (29)