The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagram
Lénaïc Chizat
TL;DR
This work proves a high-probability error bound between the ResNet and its limit after a fixed number of gradient steps, and identifies the residual scale $O\big(\frac{\sqrt{D}}{LM}\big)$ as necessary and sufficient for maximal local feature updates.
Abstract
We study the gradient-based training of large-depth residual networks (ResNets) from standard random initializations. We show that infinite-depth ResNets behave as if they were infinitely wide, regardless of their actual width. More precisely, we obtain that with a fixed embedding dimension $D$, the training dynamics converges to a unique Neural Mean ODE training dynamics as the depth $L$ diverges, regardless of the scaling of the hidden width $M$. For a residual scale $Θ_D\big(\fracα{LM}\big)$ with $α=Θ_D(1)$, we obtain the error bound $O_D\big(\frac{1}{L}+ \frac{1}{\sqrt{LM}}\big)$ between the model's output and its limit after a fixed number gradient of steps. In this regime, the limit exhibits maximal local feature updates, i.e. the Mean ODE is genuinely non-linearly parameterized. In contrast, we show that $α\to \infty$ yields a lazy ODE regime where the Mean ODE is linearly parameterized, and we derive a convergence rate in this case as well. We then focus on the particular case of ResNets with two-layer perceptron blocks, for which we study how these scalings depend on the embedding dimension $D$. We identify the residual scale $O\big(\frac{\sqrt{D}}{LM}\big)$ as necessary and sufficient for maximal local feature updates. In this regime, we prove a high-probability error bound $O\big(\frac{1}{L}+ \frac{\sqrt{D}}{\sqrt{LM}}\big)$ between the ResNet and its limit after a fixed number of gradient steps. Our convergence results rely on a novel mathematical perspective on ResNets : (i) due to the randomness of the initialization, the forward and backward pass through the ResNet behave as the stochastic approximation of certain mean ODEs, and (ii) by propagation of chaos (that is, asymptotic independence of the units) this behavior is preserved through the training dynamics. We verify empirically that all our rates are tight.
