Average gradient outer product as a mechanism for deep neural collapse

Daniel Beaglehole; Peter Súkeník; Marco Mondelli; Mikhail Belkin

Average gradient outer product as a mechanism for deep neural collapse

Daniel Beaglehole, Peter Súkeník, Marco Mondelli, Mikhail Belkin

TL;DR

The paper introduces the Average Gradient Outer Product ($AGOP$) as a data-dependent mechanism for Deep Neural Collapse (DNC) and implements it via Deep Recursive Feature Machines (Deep RFM). It provides extensive empirical evidence that projecting representations onto the layerwise $AGOP$ induces NC1 and NC2 across vision datasets, and develops asymptotic and kernel-learning theories showing exponential convergence toward collapse and an implicit bias toward a collapsed kernel. The authors also connect this mechanism to standard networks, showing that the right singular structure of weights aligns with the $AGOP$ and largely drives within-class variability collapse under small initialization. Overall, the work links data-dependent feature learning to DNC through $AGOP$ projections, offering a unified explanation that integrates kernel learning, weight structure, and layer-wise learning dynamics.

Abstract

Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a variety of settings, its emergence is typically explained via data-agnostic approaches, such as the unconstrained features model. In this work, we introduce a data-dependent setting where DNC forms due to feature learning through the average gradient outer product (AGOP). The AGOP is defined with respect to a learned predictor and is equal to the uncentered covariance matrix of its input-output gradients averaged over the training dataset. The Deep Recursive Feature Machine (Deep RFM) is a method that constructs a neural network by iteratively mapping the data with the AGOP and applying an untrained random feature map. We demonstrate empirically that DNC occurs in Deep RFM across standard settings as a consequence of the projection with the AGOP matrix computed at each layer. Further, we theoretically explain DNC in Deep RFM in an asymptotic setting and as a result of kernel learning. We then provide evidence that this mechanism holds for neural networks more generally. In particular, we show that the right singular vectors and values of the weights can be responsible for the majority of within-class variability collapse for DNNs trained in the feature learning regime. As observed in recent work, this singular structure is highly correlated with that of the AGOP.

Average gradient outer product as a mechanism for deep neural collapse

TL;DR

The paper introduces the Average Gradient Outer Product (

) as a data-dependent mechanism for Deep Neural Collapse (DNC) and implements it via Deep Recursive Feature Machines (Deep RFM). It provides extensive empirical evidence that projecting representations onto the layerwise

induces NC1 and NC2 across vision datasets, and develops asymptotic and kernel-learning theories showing exponential convergence toward collapse and an implicit bias toward a collapsed kernel. The authors also connect this mechanism to standard networks, showing that the right singular structure of weights aligns with the

and largely drives within-class variability collapse under small initialization. Overall, the work links data-dependent feature learning to DNC through

projections, offering a unified explanation that integrates kernel learning, weight structure, and layer-wise learning dynamics.

Abstract

Paper Structure (23 sections, 7 theorems, 34 equations, 12 figures, 1 algorithm)

This paper contains 23 sections, 7 theorems, 34 equations, 12 figures, 1 algorithm.

Introduction
Related work
Neural collapse (NC).
AGOP feature learning.
Background and definitions
Notation
Average gradient outer product
Deep RFM
Deep Neural Collapse
Average gradient outer product induces DNC in Deep RFM
Theoretical explanations for DNC in Deep RFM
Asymptotic analysis
Connection to parametrized kernel ridge regression
Within-class variability collapse through AGOP in neural networks
Conclusion
...and 8 more sections

Key Result

Theorem 5.3

Suppose we apply Deep RFM on any dataset $X$ with labels $Y \in \mathbb{R}^{N \times K}$ choosing all $\{\Phi_l\}_l$ and $\{k_l\}_l$ as the feature map $\Phi_{\mathrm{map}}$ and kernel $\widehat{k}$ above, with no ridge parameter ($\gamma = 0$). Then, there exists a universal constant $C>0$, such th

Figures (12)

Figure 1: Neural collapse with Deep RFM on (A) CIFAR-10 and (B) MNIST. The matrix of inner products of all pairs of points in $X_{l}$ extracted from layers $l \in \{1, 3, 7, 13, 19\}$ of Deep RFM. The columns show the Gram matrices of feature vectors transformed by the AGOP from Deep RFM, $\left(\widetilde{X}_l - \widetilde{\mu}_G^l\right) \odiv \|\widetilde{X}_l - \widetilde{\mu}_G^l\|$. The data are ordered so that points of the same class are adjacent to one another, arranged from classes $1$ to $10$. Deep RFM uses non-linearity $\sigma(\cdot)=\cos(\cdot)$ in (A) and $\sigma(\cdot) = \mathop{\mathrm{ReLU}}\nolimits(\cdot)$ in (B).
Figure 2: Feature variability collapse from different singular value decomposition components in (A) an MLP on MNIST, and (B) a ResNet on CIFAR-10. We measure the reduction in the NC1 metric throughout training at each of five fully-connected layers. Each layer is decomposed into its input, $\Phi(X)$, the projection onto the right singular space of $W$, $S V^\top \Phi(X)$, and then $U$, the left singular vectors of $W$, and the application of the non-linearity.
Figure 3: Neural collapse with Deep RFM on additional datasets with $\sigma(\cdot) = \mathop{\mathrm{ReLU}}\nolimits(\cdot)$. We show $\mathop{\mathrm{tr}}\nolimits{\Sigma_W}/\mathop{\mathrm{tr}}\nolimits{\Sigma_B}$, our NC1 metric on the left, and $\left\|\tilde{\mu}\tilde{\mu}^\top - \Sigma_{\mathrm{ETF}} \right\|$, our NC2 metric, on the right. The first row is CIFAR-10, second is MNIST, third is SVHN. We plot these metrics as a function of depth of Deep RFM for the original data $X$ (green), the data after applying the square root of the AGOP $M_l^{1/2} x$ (orange), and the data after the AGOP and non-linearity (blue).
Figure 4: Neural collapse with Deep RFM on additional datasets with $\sigma(\cdot) = \cos(\cdot)$. We show $\mathop{\mathrm{tr}}\nolimits{\Sigma_W}/\mathop{\mathrm{tr}}\nolimits{\Sigma_B}$, our NC1 metric on the left, and $\left\|\tilde{\mu}\tilde{\mu}^\top - \Sigma_{\mathrm{ETF}} \right\|$, our NC2 metric, on the right. he first row is CIFAR-10, second is MNIST, third is SVHN. We plot these metrics as a function of depth of Deep RFM for the original data $X$ (green), the data after applying the square root of the AGOP $M_l^{1/2} x$ (orange), and the data after the AGOP and non-linearity (blue).
Figure 5: Visualization of neural collapse for Deep RFM on additional datasets with $\sigma(\cdot) = \cos(\cdot)$. As in the main text, we plot the Gram matrix of the centered and normalized feature vectors $\widetilde{X}_l$. We see the data form the ETF in the final column. (A) corresponds to CIFAR-10, (B) SVHN, and (C) MNIST.
...and 7 more figures

Theorems & Definitions (12)

Definition 3.2
Theorem 5.3: Deep RFM exhibits neural collapse
Theorem 5.4
Proposition A.0
Proposition B.1
proof : Proof of Proposition \ref{['prop: relaxation tight, parametrized krr']}
Lemma C.1: Woodbury Inverse Formula WoodburyInverseFormula
Lemma C.2: Fixed point of collapse
proof : Proof of Lemma \ref{['lemma: A-star inverse']}
proof : Proof of Proposition \ref{['prop:DeepRFM_NC']}
...and 2 more

Average gradient outer product as a mechanism for deep neural collapse

TL;DR

Abstract

Average gradient outer product as a mechanism for deep neural collapse

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (12)

Theorems & Definitions (12)