Average gradient outer product as a mechanism for deep neural collapse
Daniel Beaglehole, Peter Súkeník, Marco Mondelli, Mikhail Belkin
TL;DR
The paper introduces the Average Gradient Outer Product ($AGOP$) as a data-dependent mechanism for Deep Neural Collapse (DNC) and implements it via Deep Recursive Feature Machines (Deep RFM). It provides extensive empirical evidence that projecting representations onto the layerwise $AGOP$ induces NC1 and NC2 across vision datasets, and develops asymptotic and kernel-learning theories showing exponential convergence toward collapse and an implicit bias toward a collapsed kernel. The authors also connect this mechanism to standard networks, showing that the right singular structure of weights aligns with the $AGOP$ and largely drives within-class variability collapse under small initialization. Overall, the work links data-dependent feature learning to DNC through $AGOP$ projections, offering a unified explanation that integrates kernel learning, weight structure, and layer-wise learning dynamics.
Abstract
Deep Neural Collapse (DNC) refers to the surprisingly rigid structure of the data representations in the final layers of Deep Neural Networks (DNNs). Though the phenomenon has been measured in a variety of settings, its emergence is typically explained via data-agnostic approaches, such as the unconstrained features model. In this work, we introduce a data-dependent setting where DNC forms due to feature learning through the average gradient outer product (AGOP). The AGOP is defined with respect to a learned predictor and is equal to the uncentered covariance matrix of its input-output gradients averaged over the training dataset. The Deep Recursive Feature Machine (Deep RFM) is a method that constructs a neural network by iteratively mapping the data with the AGOP and applying an untrained random feature map. We demonstrate empirically that DNC occurs in Deep RFM across standard settings as a consequence of the projection with the AGOP matrix computed at each layer. Further, we theoretically explain DNC in Deep RFM in an asymptotic setting and as a result of kernel learning. We then provide evidence that this mechanism holds for neural networks more generally. In particular, we show that the right singular vectors and values of the weights can be responsible for the majority of within-class variability collapse for DNNs trained in the feature learning regime. As observed in recent work, this singular structure is highly correlated with that of the AGOP.
