Table of Contents
Fetching ...

On residual network depth

Benoit Dherin, Michael Munn

TL;DR

The paper derives the Residual Expansion Theorem, formalizing the view that deep residual networks function as a hierarchical ensemble whose depth expands the ensemble size. It identifies a combinatorial explosion of functional paths as the root of instability in unnormalized models and shows that principled scaling of residual branches by $\lambda$ (notably $1/n$ or $1/\sqrt{n}$) can tame this growth, enabling normalization-free training while affecting capacity and geometry. The work connects to andClarifies normalization-based methods like BatchNorm and Fixup, offering a first-principles explanation for their empirical success and guiding principled designer choices for very deep architectures. It also points to future directions, including applying the function-first scaling perspective to Transformers and exploring principled scaling laws beyond finite depth.

Abstract

Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network's inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model's complexity.

On residual network depth

TL;DR

The paper derives the Residual Expansion Theorem, formalizing the view that deep residual networks function as a hierarchical ensemble whose depth expands the ensemble size. It identifies a combinatorial explosion of functional paths as the root of instability in unnormalized models and shows that principled scaling of residual branches by (notably or ) can tame this growth, enabling normalization-free training while affecting capacity and geometry. The work connects to andClarifies normalization-based methods like BatchNorm and Fixup, offering a first-principles explanation for their empirical success and guiding principled designer choices for very deep architectures. It also points to future directions, including applying the function-first scaling perspective to Transformers and exploring principled scaling laws beyond finite depth.

Abstract

Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network's inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model's complexity.

Paper Structure

This paper contains 20 sections, 2 theorems, 24 equations, 2 figures, 1 table.

Key Result

Theorem 3.1

Consider a residual network with $n$ blocks of the form given in Equation equation:residual_network. First of all, the residual tower admits the following expansion: Moreover the residual network can be expressed as a infinite sum of increasingly larger ensembles of models as a result: Moreover, if we further assume that the encoding network is an affine map $E_\xi(x) = W_\xi x + b_\xi$, then th

Figures (2)

  • Figure 1: CIFAR-10 trained on residual network with $n=16$ residual blocks. We plot the learning curves for the experiments in Figure \ref{['figure:accuracy_and_gc_vs_lambda']} for a sweep $\lambda\in\{0, n^{-2}, n^{-1.5}, n^{-1.2}, n, n^{-0.8}, n^{-0.5}\}$. Training with $\lambda > 1/\sqrt n$ (e.g., we also tried $\lambda \in \{ 1/n^{0.3}, 1/n^{0.4}, 1\}$) all failed.
  • Figure 2: Maximum test accuracy and geometric complexity at time of maximum test accuracy for various values of $\lambda$. Left: As $\lambda$ increases, maximum test accuracy increases. Right: However, increasing $\lambda$ leads to decreased model complexity after a first phase of increase.

Theorems & Definitions (5)

  • Theorem 3.1: The Residual Expansion Theorem
  • proof
  • Remark 4.1
  • Corollary B.0.1
  • proof