Unifying Low Dimensional Observations in Deep Learning Through the Deep Linear Unconstrained Feature Model

Connall Garrod; Jonathan P. Keating

Unifying Low Dimensional Observations in Deep Learning Through the Deep Linear Unconstrained Feature Model

Connall Garrod, Jonathan P. Keating

TL;DR

This work addresses why deep networks exhibit strikingly low-dimensional spectral structures by introducing the Deep Unconstrained Feature Model (UFM) and linking them to Deep Neural Collapse (DNC). It derives analytic expressions for Hessian, gradient, and weight spectra in terms of class-mean features, showing that the layer-wise Hessians have a Kronecker structure with a $K^2$-outlier spectrum and that the full Hessian inherits this structure in deep networks. The results extend from linear to ReLU UFMs and hold under both MSE and CE losses (with adjustments), with empirical validation on synthetic UFMs and real networks on MNIST/CIFAR-10. Altogether, DNC provides a unifying theoretical lens for curvature, gradient alignment, and weight structure, with implications for training dynamics and regularization strategies in overparameterized regimes.

Abstract

Empirical studies have revealed low dimensional structures in the eigenspectra of weights, Hessians, gradients, and feature vectors of deep networks, consistently observed across datasets and architectures in the overparameterized regime. In this work, we analyze deep unconstrained feature models (UFMs) to provide an analytic explanation of how these structures emerge at the layerwise level, including the bulk outlier Hessian spectrum and the alignment of gradient descent with the outlier eigenspace. We show that deep neural collapse underlies these phenomena, deriving explicit expressions for eigenvalues and eigenvectors of many deep learning matrices in terms of class feature means. Furthermore, we demonstrate that the full Hessian inherits its low dimensional structure from the layerwise Hessians, and empirically validate our theory in both UFMs and deep networks.

Unifying Low Dimensional Observations in Deep Learning Through the Deep Linear Unconstrained Feature Model

TL;DR

-outlier spectrum and that the full Hessian inherits this structure in deep networks. The results extend from linear to ReLU UFMs and hold under both MSE and CE losses (with adjustments), with empirical validation on synthetic UFMs and real networks on MNIST/CIFAR-10. Altogether, DNC provides a unifying theoretical lens for curvature, gradient alignment, and weight structure, with implications for training dynamics and regularization strategies in overparameterized regimes.

Abstract

Paper Structure (37 sections, 18 theorems, 216 equations, 18 figures)

This paper contains 37 sections, 18 theorems, 216 equations, 18 figures.

Introduction
Related Works
Background
Low Dimensional Structure in the Deep Linear UFM
Hessian Spectra
Gradient Alignment with Outlier Eigenspace
Weight Matrices
Full Hessian Spectrum
Low Dimensional Structure in the Deep ReLU UFM
Numerical Experiments
Concluding Remarks
Motivating the deep Unconstrained Feature Model
Experimental Evaluations of UFMs
The Theoretical Basis of the UFM
Why the UFM Can Exactly Fit the Data
...and 22 more sections

Key Result

Theorem 1

Consider the deep linear UFM described in eq:deep_UFM. Let the network width satisfy $d \geq K$, and consider a layer $l$ with $1 \leq l < L$. Assume further that the regularization parameter $\lambda$ satisfies the condition in eq:reg_condition. Then, at any global optimum of the loss, the layer-wi As a consequence, $\textrm{Hess}_l$ has rank $K^2$, with nonzero eigenvectors given by $\hat{\mu}_c

Figures (18)

Figure 1: Training of a deep linear UFM. Left: Squared cosine similarity between $\mu_{c}^{(l+1)} \otimes \mu_{c'}^{(l)}$ and $\textrm{Hess}_l(\mu_{c}^{(l+1)} \otimes \mu_{c'}^{(l)})$. Middle & Right: Decomposition coefficients of $\tilde{g}^{(l)}$ in terms of the predicted eigenvectors $\mu_c^{(l+1)} \otimes \mu_{c'}^{(l)}$, measured by squared cosine similarity. Middle: $c=c'$, right: $c \neq c'$.
Figure 2: Histograms of the spectrum of $\textrm{Hess}_l$ for a deep linear UFM at an intermediate layer $l$ over a range of training epochs. The top $K^2=16$ outlier eigenvalues are plotted as spikes.
Figure 3: Early stages of training for a layer of a deep ReLU UFM. Left: Proportion of feature vector entries below $-10^{-6}$. Right: Frobenius distance of $\bar{H}_l^T \bar{H}_l$ from $I$ after normalization.
Figure 4: Early stages of training for a deep ReLU UFM. Left: Squared cosine similarity between $\mu_{c}^{(l+1)} \otimes \mu_{c'}^{(l)}$ and $\textrm{Hess}_l (\mu_{c}^{(l+1)} \otimes \mu_{c'}^{(l)})$. Middle & Right: Decomposition coefficients of $\tilde{g}^{(l)}$ in terms of the predicted eigenvectors $\mu_c^{(l+1)} \otimes \mu_{c'}^{(l)}$, measured by squared cosine similarity. The middle panel corresponds to $c=c'$, and the right panel to $c \neq c'$.
Figure 5: Histograms of the spectrum of $\textrm{Hess}_l$ for a deep ReLU UFM at an intermediate layer $l$ over a range of training epochs. The top $K^2=16$ outlier eigenvalues are plotted as spikes.
...and 13 more figures

Theorems & Definitions (21)

Definition 1: Deep Neural Collapse
Theorem 1
Theorem 2
Theorem 3
Theorem 4
Theorem 5
Theorem 6
Definition 2: DNC Structure in the Deep ReLU UFM
Theorem 7
Theorem 8
...and 11 more

Unifying Low Dimensional Observations in Deep Learning Through the Deep Linear Unconstrained Feature Model

TL;DR

Abstract

Unifying Low Dimensional Observations in Deep Learning Through the Deep Linear Unconstrained Feature Model

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (21)