Provable Emergence of Deep Neural Collapse and Low-Rank Bias in $L^2$-Regularized Nonlinear Networks
Emanuele Zangrando, Piero Deidda, Simone Brugiapaglia, Nicola Guglielmi, Francesco Tudisco
TL;DR
The paper addresses why deep neural collapse (DNC1) and low-rank bias emerge in nonlinear networks trained with $L^2$ weight decay. It develops a unified framework connecting DNC1 to implicit low-rank structure via the Total Cluster Variation (TCV) of intermediate features, deriving a rank–TCV bound that ties layer-wise weight rank to data clustering. It proves that zero TCV across intermediate layers is globally optimal under natural architectural constraints for both feedforward and residual networks, and shows a benign landscape: from almost any interpolating initialization there exists a loss-decreasing path to a DNC1-satisfying configuration. Empirically, the results corroborate the theory, showing weight matrices approach rank-$K$ in proportion to TCV and that DNC1 configurations are reachable and stable under small weight decay, offering a theoretical explanation for the universality of neural collapse and the low-rank bias in deep networks.
Abstract
We present a unified theoretical framework connecting the first property of Deep Neural Collapse (DNC1) to the emergence of implicit low-rank bias in nonlinear networks trained with $L^2$ weight decay regularization. Our main contributions are threefold. First, we derive a quantitative relation between the Total Cluster Variation (TCV) of intermediate embeddings and the numerical rank of stationary weight matrices. In particular, we establish that, at any critical point, the distance from a weight matrix to the set of rank-$K$ matrices is bounded by a constant times the TCV of earlier-layer features, scaled inversely with the weight-decay parameter. Second, we prove global optimality of DNC1 in a constrained representation-cost setting for both feedforward and residual architectures, showing that zero TCV across intermediate layers minimizes the representation cost under natural architectural constraints. Third, we establish a benign landscape property: for almost every interpolating initialization there exists a continuous, loss-decreasing path from the initialization to a globally optimal, DNC1-satisfying configuration. Our theoretical claims are validated empirically; numerical experiments confirm the predicted relations among TCV, singular-value structure, and weight decay. These results indicate that neural collapse and low-rank bias are intimately linked phenomena arising from the optimization geometry induced by weight decay.
