Table of Contents
Fetching ...

The Persistence of Neural Collapse Despite Low-Rank Bias

Connall Garrod, Jonathan P. Keating

TL;DR

The work investigates why neural collapse (NC) and its deep variant (DNC) arise in trained classifiers, focusing on deep unconstrained feature models (UFM) under cross-entropy loss. It provides a global analysis showing that high-rank DNC is not generally optimal as network depth grows, exposing a low-rank bias that constrains the singular values of the optimal output $Z$ and shapes the loss landscape. The study proves that, for deep linear UFMs, global minima favor diagonally superior, low-rank structures and that DNC can persist as a local minimum or critical point with vanishing gradients or PSD Hessian when regularization is small. Extending to deep ReLU UFMs, the results hold under reasonable assumptions, confirming the persistence of low-rank bias across nonlinearities and providing theoretical foundations for the empirical observation that DNC often appears despite suboptimality. Overall, the paper offers the first comprehensive theoretical framework linking low-rank bias to the prevalence of DNC, with implications for how optimization dynamics and architecture influence feature- and weight-space geometry in deep networks.

Abstract

Neural collapse (NC) and its multi-layer variant, deep neural collapse (DNC), describe a structured geometry that occurs in the features and weights of trained deep networks. Recent theoretical work by Sukenik et al. using a deep unconstrained feature model (UFM) suggests that DNC is suboptimal under mean squared error (MSE) loss. They heuristically argue that this is due to low-rank bias induced by L2 regularization. In this work, we extend this result to deep UFMs trained with cross-entropy loss, showing that high-rank structures, including DNC, are not generally optimal. We characterize the associated low-rank bias, proving a fixed bound on the number of non-negligible singular values at global minima as network depth increases. We further analyze the loss surface, demonstrating that DNC is more prevalent in the landscape than other critical configurations, which we argue explains its frequent empirical appearance. Our results are validated through experiments in deep UFMs and deep neural networks.

The Persistence of Neural Collapse Despite Low-Rank Bias

TL;DR

The work investigates why neural collapse (NC) and its deep variant (DNC) arise in trained classifiers, focusing on deep unconstrained feature models (UFM) under cross-entropy loss. It provides a global analysis showing that high-rank DNC is not generally optimal as network depth grows, exposing a low-rank bias that constrains the singular values of the optimal output and shapes the loss landscape. The study proves that, for deep linear UFMs, global minima favor diagonally superior, low-rank structures and that DNC can persist as a local minimum or critical point with vanishing gradients or PSD Hessian when regularization is small. Extending to deep ReLU UFMs, the results hold under reasonable assumptions, confirming the persistence of low-rank bias across nonlinearities and providing theoretical foundations for the empirical observation that DNC often appears despite suboptimality. Overall, the paper offers the first comprehensive theoretical framework linking low-rank bias to the prevalence of DNC, with implications for how optimization dynamics and architecture influence feature- and weight-space geometry in deep networks.

Abstract

Neural collapse (NC) and its multi-layer variant, deep neural collapse (DNC), describe a structured geometry that occurs in the features and weights of trained deep networks. Recent theoretical work by Sukenik et al. using a deep unconstrained feature model (UFM) suggests that DNC is suboptimal under mean squared error (MSE) loss. They heuristically argue that this is due to low-rank bias induced by L2 regularization. In this work, we extend this result to deep UFMs trained with cross-entropy loss, showing that high-rank structures, including DNC, are not generally optimal. We characterize the associated low-rank bias, proving a fixed bound on the number of non-negligible singular values at global minima as network depth increases. We further analyze the loss surface, demonstrating that DNC is more prevalent in the landscape than other critical configurations, which we argue explains its frequent empirical appearance. Our results are validated through experiments in deep UFMs and deep neural networks.

Paper Structure

This paper contains 28 sections, 274 equations, 8 figures.

Figures (8)

  • Figure 1: Experiments in the deep linear UFM: Left: Loss curves for a solution that converges to DNC versus one that converges to the low-rank structure described in Equation \ref{['eq:low_rank_sol']}. Middle/Right: Corresponding mean logit matrices at convergence. Hyperparameters: $L = 2$, $d = 70$, $\lambda = 2^{-10}$, $K = 10$, $n = 5$, learning rate $= 0.5$.
  • Figure 2: Experiments in the deep linear UFM: Left: Empirical probability of DNC versus width $d$. Right: Empirical probability of DNC versus regularization $\lambda$. Averaged over 10 runs; same hyperparameters as Figure 1.
  • Figure 3: Experiments using UFM-style regularization: Top: Losses of low-rank solutions on MNIST and CIFAR-10 using linear layers in the fully connected head, along with mean logit matrix for CIFAR-10. Hyperparameters: $L = 3$, $d = 64$, $\lambda_W = 5 \times 10^{-3}$, $\lambda_H = 10^{-6}$, learning rate = $0.05$. Bottom: Loss and singular values across layers for CIFAR-10 using ReLU in the fully connected head. Same hyperparameters as above, except $L = 4$.
  • Figure 4: Experiments with standard regularization on CIFAR-10: Left: Singular values of each feature matrix in the fully connected head. Right: Mean logit matrix. Hyperparameters: $L = 3$, $d = 64$, $\lambda = 10^{-2}$, learning rate = $10^{-3}$.
  • Figure 5: Experiments in the deep linear UFM: Left: Rank of converged solution versus width $d$. Right: Rank of converged solution versus regularization $\lambda$. Averaged over 10 runs; same hyperparameters as Figure 1.
  • ...and 3 more figures