Table of Contents
Fetching ...

Scaling ResNets in the Large-depth Regime

Pierre Marion, Adeline Fermanian, Gérard Biau, Jean-Philippe Vert

TL;DR

This work analyzes how scaling the residual term in deep ResNets interacts with the weight initialization to determine stability in very deep networks. By separating the large-depth behavior into regimes determined by the product $L\alpha_L^2$, the authors show that i.i.d. initializations with $\alpha_L\approx 1/\sqrt{L}$ lead to neural SDE limits, while $\alpha_L=1/L$ (under correlated weights) yields neural ODE behavior; both pages connect discrete ResNets to continuous-time dynamics. The paper provides rigorous probabilistic bounds for forward and backward signal propagation, demonstrates a continuous spectrum of regimes via fractional Brownian motion weight increments, and presents experiments on MNIST and CIFAR-10 showing how weight regularity and scaling jointly affect trainability and generalization. These insights clarify when deep ResNets behave like diffusion processes versus deterministic flows and offer practical guidance for initialization to balance stability and representational power in very deep networks.

Abstract

Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth $L$ increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor $α_L$. We show in a probabilistic setting that with standard i.i.d.~initializations, the only non-trivial dynamics is for $α_L = \frac{1}{\sqrt{L}}$; other choices lead either to explosion or to identity mapping. This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and $α_L = \frac{1}{L}$. Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.

Scaling ResNets in the Large-depth Regime

TL;DR

This work analyzes how scaling the residual term in deep ResNets interacts with the weight initialization to determine stability in very deep networks. By separating the large-depth behavior into regimes determined by the product , the authors show that i.i.d. initializations with lead to neural SDE limits, while (under correlated weights) yields neural ODE behavior; both pages connect discrete ResNets to continuous-time dynamics. The paper provides rigorous probabilistic bounds for forward and backward signal propagation, demonstrates a continuous spectrum of regimes via fractional Brownian motion weight increments, and presents experiments on MNIST and CIFAR-10 showing how weight regularity and scaling jointly affect trainability and generalization. These insights clarify when deep ResNets behave like diffusion processes versus deterministic flows and offer practical guidance for initialization to balance stability and representational power in very deep networks.

Abstract

Deep ResNets are recognized for achieving state-of-the-art results in complex machine learning tasks. However, the remarkable performance of these architectures relies on a training procedure that needs to be carefully crafted to avoid vanishing or exploding gradients, particularly as the depth increases. No consensus has been reached on how to mitigate this issue, although a widely discussed strategy consists in scaling the output of each layer by a factor . We show in a probabilistic setting that with standard i.i.d.~initializations, the only non-trivial dynamics is for ; other choices lead either to explosion or to identity mapping. This scaling factor corresponds in the continuous-time limit to a neural stochastic differential equation, contrarily to a widespread interpretation that deep ResNets are discretizations of neural ordinary differential equations. By contrast, in the latter regime, stability is obtained with specific correlated initializations and . Our analysis suggests a strong interplay between scaling and regularity of the weights as a function of the layer index. Finally, in a series of experiments, we exhibit a continuous range of regimes driven by these two parameters, which jointly impact performance before and after training.
Paper Structure (31 sections, 20 theorems, 131 equations, 15 figures, 4 tables)

This paper contains 31 sections, 20 theorems, 131 equations, 15 figures, 4 tables.

Key Result

Proposition 1

Let res-1, res-2, and res-3 be the models defined in Table tab:examples. Then

Figures (15)

  • Figure 1: Evolution of $\|h_L - h_0\|/\|h_0\|$ as a function of $L$ for different values of $\beta$ and an i.i.d. $\mathcal{U}(-\sqrt{3/d}, \sqrt{3/d})$ initialization of model res-3, with $d=40$. The input is a random Gaussian observation $x$ in dimension $n_{\textnormal{in}} = 64$. The experiment is repeated with $50$ independent randomizations.
  • Figure 2: Empirical distributions of the norms for $\beta = 1/2$, $L=10^3$, $d=100$. The experiment is repeated with $10^4$ independent randomizations.
  • Figure 3: Evolution of $\|p_0-p_L\| / \|p_L\|$ as a function of $L$ for different values of $\beta$ and an i.i.d. $\mathcal{U}(-\sqrt{3/d}, \sqrt{3/d})$ initialization of model res-3, with $d=40$. The input is a random Gaussian observation $x$ in dimension $n_{\textnormal{in}} = 64$. The experiment is repeated with $50$ independent randomizations.
  • Figure 4: Evolution of $\|h_L-h_0\|/\|h_0\|$ as a function of $L$ for different values of $\beta$ and a smooth initialization of model res-3, with $d=40$. The input is a random Gaussian observation $x$ in dimension $n_{\textnormal{in}} = 64$. The experiment is repeated with $50$ independent randomizations.
  • Figure 5: Evolution of $\|p_0-p_L\|/\|p_L\|$ as a function of $L$ for different values of $\beta$ and a smooth initialization of model res-3, with $d=40$. The input is a random Gaussian observation $x$ in dimension $n_{\textnormal{in}} = 64$. The experiment is repeated with $50$ independent randomizations.
  • ...and 10 more figures

Theorems & Definitions (23)

  • Proposition 1
  • Remark 2
  • Proposition 3
  • Proposition 4
  • Corollary 5
  • Proposition 6
  • Proposition 7
  • Proposition 8
  • Corollary 9
  • Definition 10
  • ...and 13 more