Table of Contents
Fetching ...

The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagram

Lénaïc Chizat

TL;DR

This work proves a high-probability error bound between the ResNet and its limit after a fixed number of gradient steps, and identifies the residual scale $O\big(\frac{\sqrt{D}}{LM}\big)$ as necessary and sufficient for maximal local feature updates.

Abstract

We study the gradient-based training of large-depth residual networks (ResNets) from standard random initializations. We show that infinite-depth ResNets behave as if they were infinitely wide, regardless of their actual width. More precisely, we obtain that with a fixed embedding dimension $D$, the training dynamics converges to a unique Neural Mean ODE training dynamics as the depth $L$ diverges, regardless of the scaling of the hidden width $M$. For a residual scale $Θ_D\big(\fracα{LM}\big)$ with $α=Θ_D(1)$, we obtain the error bound $O_D\big(\frac{1}{L}+ \frac{1}{\sqrt{LM}}\big)$ between the model's output and its limit after a fixed number gradient of steps. In this regime, the limit exhibits maximal local feature updates, i.e. the Mean ODE is genuinely non-linearly parameterized. In contrast, we show that $α\to \infty$ yields a lazy ODE regime where the Mean ODE is linearly parameterized, and we derive a convergence rate in this case as well. We then focus on the particular case of ResNets with two-layer perceptron blocks, for which we study how these scalings depend on the embedding dimension $D$. We identify the residual scale $O\big(\frac{\sqrt{D}}{LM}\big)$ as necessary and sufficient for maximal local feature updates. In this regime, we prove a high-probability error bound $O\big(\frac{1}{L}+ \frac{\sqrt{D}}{\sqrt{LM}}\big)$ between the ResNet and its limit after a fixed number of gradient steps. Our convergence results rely on a novel mathematical perspective on ResNets : (i) due to the randomness of the initialization, the forward and backward pass through the ResNet behave as the stochastic approximation of certain mean ODEs, and (ii) by propagation of chaos (that is, asymptotic independence of the units) this behavior is preserved through the training dynamics. We verify empirically that all our rates are tight.

The Hidden Width of Deep ResNets: Tight Error Bounds and Phase Diagram

TL;DR

This work proves a high-probability error bound between the ResNet and its limit after a fixed number of gradient steps, and identifies the residual scale as necessary and sufficient for maximal local feature updates.

Abstract

We study the gradient-based training of large-depth residual networks (ResNets) from standard random initializations. We show that infinite-depth ResNets behave as if they were infinitely wide, regardless of their actual width. More precisely, we obtain that with a fixed embedding dimension , the training dynamics converges to a unique Neural Mean ODE training dynamics as the depth diverges, regardless of the scaling of the hidden width . For a residual scale with , we obtain the error bound between the model's output and its limit after a fixed number gradient of steps. In this regime, the limit exhibits maximal local feature updates, i.e. the Mean ODE is genuinely non-linearly parameterized. In contrast, we show that yields a lazy ODE regime where the Mean ODE is linearly parameterized, and we derive a convergence rate in this case as well. We then focus on the particular case of ResNets with two-layer perceptron blocks, for which we study how these scalings depend on the embedding dimension . We identify the residual scale as necessary and sufficient for maximal local feature updates. In this regime, we prove a high-probability error bound between the ResNet and its limit after a fixed number of gradient steps. Our convergence results rely on a novel mathematical perspective on ResNets : (i) due to the randomness of the initialization, the forward and backward pass through the ResNet behave as the stochastic approximation of certain mean ODEs, and (ii) by propagation of chaos (that is, asymptotic independence of the units) this behavior is preserved through the training dynamics. We verify empirically that all our rates are tight.

Paper Structure

This paper contains 45 sections, 15 theorems, 127 equations, 5 figures.

Key Result

Theorem 1

Let Assumption ass:regularity hold with $B>0$, let $\alpha=1$ (MLU regime), and let $\mu_0\in \mathcal{P}(\mathbb{R}^p)$ be a subgaussian distribution with variance proxy $\sigma_0^2\leq B$. Then $\forall k\geq 0$, there exists $c_1,c_2>0$ that only depend on $B,D$ and $k\eta$ such that with probabi provided that the right-hand side is smaller than $c_2$.

Figures (5)

  • Figure 1: Forward pass (1D projection, fixed input) of trained ResNets (after $K=100$ GD iterations) with two-layer-perceptron blocks, varying depths $L$ and hidden width $M=1$. The red curve shows the corresponding forward pass for the limit model, approximated with a ResNet of very large hidden width and depth (setting detailed in Section \ref{['sec:experiments-LM']}). The convergence rate towards the red curve is shown in Figure \ref{['fig:experiment1']} and characterized in Theorem \ref{['claim:D-dependence']}.
  • Figure 2: Comparison of the experimental errors (bullets) with the theoretical upper-bound $a/L+b/\sqrt{ML}$ from Theorem \ref{['thm:main']} with $a=0.15$ and $b=0.22$ manually adjusted to fit observations (plain lines). The y-axis shows root mean square error (averaged over $10$ random repetitions) on the output after $k=100$ GD steps (same setting as Figure \ref{['fig:cover']}, details in Section \ref{['sec:experiments-LM']}).
  • Figure 3: (left) The square loss of the Mean ODE model is close to $0$ at $k=100$ indicating approximate convergence (right) Various 2D projections of the curve in $\mathbb{R}^D$ representing the evolution of the weight $(\hat{U}^{1,\ell}_k)_{k\in [1:100]}$ where $\ell$ ranges from $1$ (blue) to $L$ (purple). For the purpose of illustration and for this plot only, we have initialized $(\hat{U}^{1,\ell}_0,\hat{V}^{1,\ell}_0)=(U_0,V_0)$$\forall \ell$ (while the rest of the weights for $j\geq 2$ are independently initialized across layers). This illustrates two important properties: (i) the evolution of $\hat{U}$ is significant (MLU regime) and (ii) the map $(\ell,k) \mapsto U_k(s_{\ell-1}) \approx \hat{U}^{1,\ell}_k$ is regular in $\ell$ and $k$.
  • Figure 4: Phase diagram for the ResNets \ref{['eq:parameterization-uv']} as a function of the initialization scale $\sigma_u$ (for all shapes such that $D=O(LM)$. In the sub-critical MLU regime (green area), the behavior is asymptotically the same as if $\sigma_v=0$ (see Remark \ref{['rmk:subcritical']}). Therefore, the critical MLU regime (blue) stands out as the only scaling with MLU and also maximal feature diversity.
  • Figure 5: RMS error on the output between the ResNet and the limit (Mean ODE) model after $k$ GD steps. The red dots correspond to a large ratio $D/(ML)$ and are outside of the regime covered by our theory, while the blue dots are within the theory. We train 2100 ResNets of various shapes with $M\in \{1,\dots,200\}$, $L\in \{1,\dots, 200\}$ and $D\in \{1,\dots,100\}$ and compare the error with $a_k/L + b_k\sqrt{D/(LM)}$ where for each $k$, the coefficients $a_k$ and $b_k$ are estimated via least-squares fit. These results demonstrate the tightness of our analysis.

Theorems & Definitions (28)

  • Theorem 1: Convergence in the MLU regime
  • Remark 2.1: Analogy between ResNets and SGD
  • Theorem 2: Convergence in the lazy ODE regime
  • Remark 4.1
  • Theorem 3: Error bound for ResNets with 2LP blocks with dimensional dependency
  • Remark 4.2: Subcritical MLU regime
  • Lemma 5.1
  • proof
  • Lemma 5.2: Propagation of regularity
  • proof
  • ...and 18 more