Residual Alignment: Uncovering the Mechanisms of Residual Networks

Jianing Li; Vardan Papyan

Residual Alignment: Uncovering the Mechanisms of Residual Networks

Jianing Li, Vardan Papyan

TL;DR

This work investigates why ResNets perform so well by linearizing residual blocks via Residual Jacobians and applying SVD to reveal Residual Alignment (RA), a four-part phenomenon (RA1–RA4) describing equidistant, line-like intermediate representations and aligned top singular vectors with depth-scaling of singular values. The authors prove a link between RA2–RA4 and RA1 in binary classification, and introduce the Unconstrained Jacobians Model to theoretically realize RA as an optimal property of Jacobians. Empirically, RA is observed across diverse ResNet variants, depths, and datasets, co-occurring with Neural Collapse and disappearing when skip connections are removed; counterfactuals show how class count and stochastic depth modulate RA. The discussion outlines implications for generalization, potential extension to Transformers and recurrent architectures, and prospects for model compression and new regularization strategies, supported by a theoretical framework that connects RA to broader phenomena in deep learning.

Abstract

The ResNet architecture has been widely adopted in deep learning due to its significant boost to performance through the use of simple skip connections, yet the underlying mechanisms leading to its success remain largely unknown. In this paper, we conduct a thorough empirical study of the ResNet architecture in classification tasks by linearizing its constituent residual blocks using Residual Jacobians and measuring their singular value decompositions. Our measurements reveal a process called Residual Alignment (RA) characterized by four properties: (RA1) intermediate representations of a given input are equispaced on a line, embedded in high dimensional space, as observed by Gai and Zhang [2021]; (RA2) top left and right singular vectors of Residual Jacobians align with each other and across different depths; (RA3) Residual Jacobians are at most rank C for fully-connected ResNets, where C is the number of classes; and (RA4) top singular values of Residual Jacobians scale inversely with depth. RA consistently occurs in models that generalize well, in both fully-connected and convolutional architectures, across various depths and widths, for varying numbers of classes, on all tested benchmark datasets, but ceases to occur once the skip connections are removed. It also provably occurs in a novel mathematical model we propose. This phenomenon reveals a strong alignment between residual branches of a ResNet (RA2+4), imparting a highly rigid geometric structure to the intermediate representations as they progress linearly through the network (RA1) up to the final layer, where they undergo Neural Collapse.

Residual Alignment: Uncovering the Mechanisms of Residual Networks

TL;DR

Abstract

Paper Structure (42 sections, 2 theorems, 31 equations, 53 figures, 1 table)

This paper contains 42 sections, 2 theorems, 31 equations, 53 figures, 1 table.

Introduction
Background
Problem Statement
Method Overview
Contributions
Results Summary
Methods
Networks
Datasets
Optimization
Randomized SVD
Results
Empirical Results
(RA2+3+4) Imply (RA1)
Unconstrained Jacobians Model Leads to RA
...and 27 more sections

Key Result

Theorem 3.1

For binary classification, in a pre-activation ResNet, assuming the Jacobian linearizations are exact and satisfy (RA2+3+4), then (RA1) holds for the intermediate representations.

Figures (53)

Figure 1: Visualization of Residual Alignment. Intermediate representations of a ResNet34, trained on CIFAR10, are projected onto two random vectors. Representations of each individual image are color-coded based on its true label and connected to form a trajectory, so as to showcase their progression throughout the network. Notice the linear arrangement of intermediate representations along with equidistant spacing between representations corresponding to consecutive layers (RA1) . Our work shows, this phenomenon results from the alignment of top singular vectors of Residual Jacobians (RA2) and the inverse scaling of top singular values with depth (RA4) . It is also noteworthy that the magnitudes of class means significantly increase with depth compared to the within-class variability, indicating the representations undergo layer-wise Neural Collapse papyan2020tracesgalanti2022implicithe2022lawli2023principled.
Figure 2: (RA2) : Top singular vectors of Residual Jacobians align. Subfigure \ref{['fig:c20']} and Subfigure \ref{['fig:c21']} present the alignment of the first 8 blocks and the last 7 blocks, respectively, for a ResNet34 trained on CIFAR100 (Type 3 model in Section \ref{['sec:models']}) forwarding a single randomly sampled input. Each subplot $(i,j)$ illustrates the matrix $U_{j,30}^\top J_i V_{j,30}$, where $U_{j,30}$ and $V_{j,30}$ denote the top-30 left and right singular vectors of the Residual Jacobian $J_j$, respectively, and $i,j$ are the indices of the residual blocks, i.e., their depth. A distinct diagonal line of intense pixels is apparent in almost every subplot, signifying that the top singular vectors of $J_j$ diagonalize $J_i$. In simpler terms, this means that the top singular vectors of $J_i$ and $J_j$ align and (RA2) holds. This pattern persists when $V_{j,30}^\top J_i U_{j,30}$ is plotted, instead of $U_{j,30}^\top J_i V_{j,30}$, further confirming that the top left and right singular vectors align in accordance with (RA2) . Additional visualizations of both matrices, across various models and datasets, are available in subsections \ref{['subsec:aUJV']} and \ref{['subsec:aVJU']} of the Appendix. It is crucial to highlight that no alignment exists between the Jacobians at initialization, and the alignment emerges during training.
Figure 3: (RA3) : Singular vector alignment occurs in subspace of rank $\mathbf{\leq C}$. The figure presents a sequence of subplots that illustrate the matrix $U_{16,10}^\top J_9 V_{16,10}$. Here, $J_9$ represents the $9$-th Residual Jacobian, while $U_{16,10}$ and $V_{16,10}$ correspond to the leading $10$ left and right singular vectors, respectively, of the $16$-th Residual Jacobian, $J_{16}$. These calculations are based on ResNet34 models (Type 1 model in Section \ref{['sec:models']}). These models have been trained on specific subsets of the CIFAR10 dataset, comprising of 4, 6, and 8 classes, as well as the complete CIFAR10 and CIFAR100 datasets. Each result is presented in the corresponding Subfigures \ref{['fig:c4']}, \ref{['fig:c6']}, \ref{['fig:c8']}, \ref{['fig:c10']}, and \ref{['fig:c100']}. As the number of classes increases, the alignment of singular vectors occurs in an increasingly higher-dimensional subspace.
Figure 4: Depiction of Residual Jacobian singular values for ResNet34 trained on CIFAR10 (Type 1 model in Section \ref{['sec:models']}). Subfigure \ref{['fig:sval']} shows the top $20$ singular values of Residual Jacobians, while Subfigure \ref{['fig:val']} illustrates the inverse scaling of the top $1$ values. More singular value plots, from diverse models and datasets, are available in subsection \ref{['subsec:asval']} of the Appendix.
Figure 5: Stochastic depth amplifies singular vector alignment. A comparison of (RA2) for two Type 3 models trained on CIFAR10 over $50$ epochs: one employing the stochastic depth technique (with a drop probability of 0.3 for skipping residual blocks during training) and the other without it.
...and 48 more figures

Theorems & Definitions (5)

Theorem 3.1
Definition 3.2: Unconstrained Jacobians Model
Theorem 3.3
proof
proof

Residual Alignment: Uncovering the Mechanisms of Residual Networks

TL;DR

Abstract

Residual Alignment: Uncovering the Mechanisms of Residual Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (53)

Theorems & Definitions (5)