Table of Contents
Fetching ...

Provable Emergence of Deep Neural Collapse and Low-Rank Bias in $L^2$-Regularized Nonlinear Networks

Emanuele Zangrando, Piero Deidda, Simone Brugiapaglia, Nicola Guglielmi, Francesco Tudisco

TL;DR

The paper addresses why deep neural collapse (DNC1) and low-rank bias emerge in nonlinear networks trained with $L^2$ weight decay. It develops a unified framework connecting DNC1 to implicit low-rank structure via the Total Cluster Variation (TCV) of intermediate features, deriving a rank–TCV bound that ties layer-wise weight rank to data clustering. It proves that zero TCV across intermediate layers is globally optimal under natural architectural constraints for both feedforward and residual networks, and shows a benign landscape: from almost any interpolating initialization there exists a loss-decreasing path to a DNC1-satisfying configuration. Empirically, the results corroborate the theory, showing weight matrices approach rank-$K$ in proportion to TCV and that DNC1 configurations are reachable and stable under small weight decay, offering a theoretical explanation for the universality of neural collapse and the low-rank bias in deep networks.

Abstract

We present a unified theoretical framework connecting the first property of Deep Neural Collapse (DNC1) to the emergence of implicit low-rank bias in nonlinear networks trained with $L^2$ weight decay regularization. Our main contributions are threefold. First, we derive a quantitative relation between the Total Cluster Variation (TCV) of intermediate embeddings and the numerical rank of stationary weight matrices. In particular, we establish that, at any critical point, the distance from a weight matrix to the set of rank-$K$ matrices is bounded by a constant times the TCV of earlier-layer features, scaled inversely with the weight-decay parameter. Second, we prove global optimality of DNC1 in a constrained representation-cost setting for both feedforward and residual architectures, showing that zero TCV across intermediate layers minimizes the representation cost under natural architectural constraints. Third, we establish a benign landscape property: for almost every interpolating initialization there exists a continuous, loss-decreasing path from the initialization to a globally optimal, DNC1-satisfying configuration. Our theoretical claims are validated empirically; numerical experiments confirm the predicted relations among TCV, singular-value structure, and weight decay. These results indicate that neural collapse and low-rank bias are intimately linked phenomena arising from the optimization geometry induced by weight decay.

Provable Emergence of Deep Neural Collapse and Low-Rank Bias in $L^2$-Regularized Nonlinear Networks

TL;DR

The paper addresses why deep neural collapse (DNC1) and low-rank bias emerge in nonlinear networks trained with weight decay. It develops a unified framework connecting DNC1 to implicit low-rank structure via the Total Cluster Variation (TCV) of intermediate features, deriving a rank–TCV bound that ties layer-wise weight rank to data clustering. It proves that zero TCV across intermediate layers is globally optimal under natural architectural constraints for both feedforward and residual networks, and shows a benign landscape: from almost any interpolating initialization there exists a loss-decreasing path to a DNC1-satisfying configuration. Empirically, the results corroborate the theory, showing weight matrices approach rank- in proportion to TCV and that DNC1 configurations are reachable and stable under small weight decay, offering a theoretical explanation for the universality of neural collapse and the low-rank bias in deep networks.

Abstract

We present a unified theoretical framework connecting the first property of Deep Neural Collapse (DNC1) to the emergence of implicit low-rank bias in nonlinear networks trained with weight decay regularization. Our main contributions are threefold. First, we derive a quantitative relation between the Total Cluster Variation (TCV) of intermediate embeddings and the numerical rank of stationary weight matrices. In particular, we establish that, at any critical point, the distance from a weight matrix to the set of rank- matrices is bounded by a constant times the TCV of earlier-layer features, scaled inversely with the weight-decay parameter. Second, we prove global optimality of DNC1 in a constrained representation-cost setting for both feedforward and residual architectures, showing that zero TCV across intermediate layers minimizes the representation cost under natural architectural constraints. Third, we establish a benign landscape property: for almost every interpolating initialization there exists a continuous, loss-decreasing path from the initialization to a globally optimal, DNC1-satisfying configuration. Our theoretical claims are validated empirically; numerical experiments confirm the predicted relations among TCV, singular-value structure, and weight decay. These results indicate that neural collapse and low-rank bias are intimately linked phenomena arising from the optimization geometry induced by weight decay.
Paper Structure (30 sections, 18 theorems, 93 equations, 6 figures)

This paper contains 30 sections, 18 theorems, 93 equations, 6 figures.

Key Result

Proposition 3.1

Let $f(\mathit{\Theta},x)$ be a feedforward neural network as in equation DEF_Neural_Network and $\mathop{\mathrm{\ell}}\nolimits:\mathbb{R}^c \times \mathbb{R}^c \to \mathbb{R}$ be a differentiable loss function as in equation Def_Loss_function. Then, for any $\{(x_i,y_i)\}_{i=1}^K\subseteq \mathbb

Figures (6)

  • Figure 1: For each couple $(\lambda,\sigma)$ we report the relative distance $\sum_{j>K}s_j^2/\sum_{j}s_j^2$ of trained weight matrices from the closest rank-$K =10$. Here, $s_j$ are the singular values of each weight matrix.
  • Figure 2: Singular values of trained networks (full line) against those predicted by \ref{['thm:main_DNC_optimal_ff']} (dots) for different values of $\delta = \frac{1}{\lambda} \in \{10,2,1,0.1,0.01,0\}$. The singular values are converging to the predicted ones as $\delta \to 0$, as shown by \ref{['prop:main_gammaconv_firstlayer']} in combination with \ref{['thm:main_DNC_optimal_ff']}.
  • Figure 3: First three plots from the left: average of different quantities on all intermediate layers and for $100$ different random initializations. Shadows represent the area included between the minimal and maximal values observed. Each optimization step represents 100 epochs. Right: TCV of every layer during training.
  • Figure 4: Norm of the intermediate weights matrices for a for $L = 6$ layer ResNet during training. As discussed in \ref{['remark:resnet_stability']}, we observe convergence to the structured minima discussed in \ref{['thm:NC1_optimal_resnet']} for $\delta = \frac{1}{\lambda}$ (as in \ref{['eq:loss_regularized_on_1st_layer']}).
  • Figure 5: Emergence of the block structure in $Z^{l,\top}Z^l$ characterizing $\mathop{\mathrm{\mathrm{DNC1}}}\nolimits$ during training. Layers for each epoch are ordered from left to right and from top to down.
  • ...and 1 more figures

Theorems & Definitions (46)

  • Proposition 3.1: Small batches yield low-rank gradients
  • Definition 3.2: Total Cluster Variation
  • Definition 3.3: Deep Neural Collapse 1
  • Lemma 3.4: Centroid-based gradient approximation of full gradient
  • proof : Proof (sketch)
  • Theorem 3.5: Small within-class variability yields low-rank bias
  • proof : Proof (sketch)
  • Definition 4.1: Constraints
  • Theorem 4.2: $\mathop{\mathrm{\mathrm{DNC1}}}\nolimits$ is optimal for constrained representation cost
  • proof : Proof (sketch)
  • ...and 36 more