On residual network depth
Benoit Dherin, Michael Munn
TL;DR
The paper derives the Residual Expansion Theorem, formalizing the view that deep residual networks function as a hierarchical ensemble whose depth expands the ensemble size. It identifies a combinatorial explosion of functional paths as the root of instability in unnormalized models and shows that principled scaling of residual branches by $\lambda$ (notably $1/n$ or $1/\sqrt{n}$) can tame this growth, enabling normalization-free training while affecting capacity and geometry. The work connects to andClarifies normalization-based methods like BatchNorm and Fixup, offering a first-principles explanation for their empirical success and guiding principled designer choices for very deep architectures. It also points to future directions, including applying the function-first scaling perspective to Transformers and exploring principled scaling laws beyond finite depth.
Abstract
Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network's inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model's complexity.
