Table of Contents
Fetching ...

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

Blake Bordelon, Lorenzo Noci, Mufan Bill Li, Boris Hanin, Cengiz Pehlevan

TL;DR

<3-5 sentence high-level summary>

Abstract

The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses $μ$P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across depths. As a remedy, we study residual networks with a residual branch scale of $1/\sqrt{\text{depth}}$ in combination with the $μ$P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings are supported and motivated by theory. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit.

Depthwise Hyperparameter Transfer in Residual Networks: Dynamics and Scaling Limit

TL;DR

<3-5 sentence high-level summary>

Abstract

The cost of hyperparameter tuning in deep learning has been rising with model sizes, prompting practitioners to find new tuning methods using a proxy of smaller networks. One such proposal uses P parameterized networks, where the optimal hyperparameters for small width networks transfer to networks with arbitrarily large width. However, in this scheme, hyperparameters do not transfer across depths. As a remedy, we study residual networks with a residual branch scale of in combination with the P parameterization. We provide experiments demonstrating that residual architectures including convolutional ResNets and Vision Transformers trained with this parameterization exhibit transfer of optimal hyperparameters across width and depth on CIFAR-10 and ImageNet. Furthermore, our empirical findings are supported and motivated by theory. Using recent developments in the dynamical mean field theory (DMFT) description of neural network learning dynamics, we show that this parameterization of ResNets admits a well-defined feature learning joint infinite-width and infinite-depth limit and show convergence of finite-size network dynamics towards this limit.
Paper Structure (58 sections, 3 theorems, 135 equations, 16 figures, 1 table)

This paper contains 58 sections, 3 theorems, 135 equations, 16 figures, 1 table.

Key Result

Proposition 1

Consider network training dynamics of a ResNet as in Equation eq:res-model with our $\mu P$$\frac{1}{\sqrt{L}}$ scaling and the appropriate choice of learning rate and $\gamma_0$ in the infinite width and depth limit. The preactivation $h(\tau;\bm x,t)$ drawn from the marginal density of neurons in where $du(\tau;\bm x;t)$ is zero mean Brownian motion. The covariance of $du$ is the feature kernel

Figures (16)

  • Figure 1: The optimal learning rate $\eta^*$ transfers across both depth and width in our proposed parameterization but not in $\mu$P or standard parameterization (Fig. \ref{['fig:lr-transfer-sp']}). Loss is plotted after $20$ epochs on CIFAR-10. All the missing datapoints indicate that the corresponding run diverged.
  • Figure 2: Effect of batch normalization on a standard Resnet18-type architecture he2016deep. The above examples are taken after $20$ epochs on CIFAR-10. Normalization layers can slightly improve consistency and trainability across depths in standard ResNets, but consistency is much more reliable with the $1/\sqrt L$ scaling (precisely, we use $\beta_\ell = 3/\sqrt{L}$ here to increase feature learning at finite depth). Runs that exceed a target loss of 0.5 are removed from the plot for visual clarity.
  • Figure 3: ViTs trained with Adam also exhibit learning rate transfer, with or without LayerNorm and $1/\sqrt{L}$-scaling. The above examples are taken after 20 epochs on Tiny ImageNet for (a)-(d) and ImageNet after $10$ epochs for (e)-(f). In (a)-(c), the learning rate is linearly increased to the target in the first $1000$ steps. In (d), after $2000$ warm-up steps, the learning rate is decreased to zero with a cosine schedule in $100$ epochs. Notice that in addition to the transfer across all the settings, the $1/\sqrt{L}$ models show a stronger non-saturating benefit of deeper models. (e)-(f) ViTs on first few epochs of ImageNet. Also, notice that (a), (d), (e) and (f) refer to models without normalization layers.
  • Figure 4: Other hyperparameters also transfer. This example shows training dynamics during the first $20$ epochs on CIFAR-10 (architecture details in Appendix \ref{['app:exp-details']}). The dynamics at two different depths are provided for (a) momentum (b) feature learning rate $\gamma_0$.
  • Figure 5: Approximation of the joint $N \to \infty$, $L \to \infty$ limit requires both sufficiently large $N$ and $L$. CNNs are compared after $250$ steps on CIFAR-10 with batchsize $32$ with $\gamma_0 = 1.0$. Since we cannot compute the infinite width and depth predictor, we use a proxy $f_{\approx \infty,\infty}$: the ensemble averaged (over random inits) predictor of networks with $N=512$ and $L = 128$. Error bars are standard deviation computed over $15$ random initializations. (a) SGD dynamics at large width are strikingly consistent across depths. (b) The convergence of $f_{N,L}$ to a large width and depth proxy $f_{\approx \infty, \infty}$ is bottlenecked by width $N$ for small $N$ while for large $N$, it decays like $\mathcal{O}(L^{-1})$.
  • ...and 11 more figures

Theorems & Definitions (3)

  • Proposition 1: Informal
  • Proposition 2: Informal
  • Proposition 3