Residual Alignment: Uncovering the Mechanisms of Residual Networks
Jianing Li, Vardan Papyan
TL;DR
This work investigates why ResNets perform so well by linearizing residual blocks via Residual Jacobians and applying SVD to reveal Residual Alignment (RA), a four-part phenomenon (RA1–RA4) describing equidistant, line-like intermediate representations and aligned top singular vectors with depth-scaling of singular values. The authors prove a link between RA2–RA4 and RA1 in binary classification, and introduce the Unconstrained Jacobians Model to theoretically realize RA as an optimal property of Jacobians. Empirically, RA is observed across diverse ResNet variants, depths, and datasets, co-occurring with Neural Collapse and disappearing when skip connections are removed; counterfactuals show how class count and stochastic depth modulate RA. The discussion outlines implications for generalization, potential extension to Transformers and recurrent architectures, and prospects for model compression and new regularization strategies, supported by a theoretical framework that connects RA to broader phenomena in deep learning.
Abstract
The ResNet architecture has been widely adopted in deep learning due to its significant boost to performance through the use of simple skip connections, yet the underlying mechanisms leading to its success remain largely unknown. In this paper, we conduct a thorough empirical study of the ResNet architecture in classification tasks by linearizing its constituent residual blocks using Residual Jacobians and measuring their singular value decompositions. Our measurements reveal a process called Residual Alignment (RA) characterized by four properties: (RA1) intermediate representations of a given input are equispaced on a line, embedded in high dimensional space, as observed by Gai and Zhang [2021]; (RA2) top left and right singular vectors of Residual Jacobians align with each other and across different depths; (RA3) Residual Jacobians are at most rank C for fully-connected ResNets, where C is the number of classes; and (RA4) top singular values of Residual Jacobians scale inversely with depth. RA consistently occurs in models that generalize well, in both fully-connected and convolutional architectures, across various depths and widths, for varying numbers of classes, on all tested benchmark datasets, but ceases to occur once the skip connections are removed. It also provably occurs in a novel mathematical model we propose. This phenomenon reveals a strong alignment between residual branches of a ResNet (RA2+4), imparting a highly rigid geometric structure to the intermediate representations as they progress linearly through the network (RA1) up to the final layer, where they undergo Neural Collapse.
