Table of Contents
Fetching ...

Neural Collapse Beyond the Unconstrained Features Model: Landscape, Dynamics, and Generalization in the Mean-Field Regime

Diyuan Wu, Marco Mondelli

TL;DR

This paper advances the theoretical understanding of Neural Collapse by analyzing NC1 in a data-dependent three-layer network within the mean-field regime, rather than the traditional unconstrained-features model. It establishes that NC1 emerges for approximately stationary points with small empirical loss and small gradient norm, and proves convergence of gradient flow to NC1 solutions; under well-separated data, NC1 and vanishing test error co-occur. A two-stage training framework and a Gibbs-form minimizer are introduced to study generalization, with concrete bounds showing NC1 alongside low test error for linearly separable data. The results connect representation geometry to loss landscape and training dynamics, offering new insights into gradient-based optimization and generalization in deep networks. Practical implications include a data-aware explanation for NC1’s prevalence during training and its compatibility with good generalization in certain data regimes.

Abstract

Neural Collapse is a phenomenon where the last-layer representations of a well-trained neural network converge to a highly structured geometry. In this paper, we focus on its first (and most basic) property, known as NC1: the within-class variability vanishes. While prior theoretical studies establish the occurrence of NC1 via the data-agnostic unconstrained features model, our work adopts a data-specific perspective, analyzing NC1 in a three-layer neural network, with the first two layers operating in the mean-field regime and followed by a linear layer. In particular, we establish a fundamental connection between NC1 and the loss landscape: we prove that points with small empirical loss and gradient norm (thus, close to being stationary) approximately satisfy NC1, and the closeness to NC1 is controlled by the residual loss and gradient norm. We then show that (i) gradient flow on the mean squared error converges to NC1 solutions with small empirical loss, and (ii) for well-separated data distributions, both NC1 and vanishing test loss are achieved simultaneously. This aligns with the empirical observation that NC1 emerges during training while models attain near-zero test error. Overall, our results demonstrate that NC1 arises from gradient training due to the properties of the loss landscape, and they show the co-occurrence of NC1 and small test error for certain data distributions.

Neural Collapse Beyond the Unconstrained Features Model: Landscape, Dynamics, and Generalization in the Mean-Field Regime

TL;DR

This paper advances the theoretical understanding of Neural Collapse by analyzing NC1 in a data-dependent three-layer network within the mean-field regime, rather than the traditional unconstrained-features model. It establishes that NC1 emerges for approximately stationary points with small empirical loss and small gradient norm, and proves convergence of gradient flow to NC1 solutions; under well-separated data, NC1 and vanishing test error co-occur. A two-stage training framework and a Gibbs-form minimizer are introduced to study generalization, with concrete bounds showing NC1 alongside low test error for linearly separable data. The results connect representation geometry to loss landscape and training dynamics, offering new insights into gradient-based optimization and generalization in deep networks. Practical implications include a data-aware explanation for NC1’s prevalence during training and its compatibility with good generalization in certain data regimes.

Abstract

Neural Collapse is a phenomenon where the last-layer representations of a well-trained neural network converge to a highly structured geometry. In this paper, we focus on its first (and most basic) property, known as NC1: the within-class variability vanishes. While prior theoretical studies establish the occurrence of NC1 via the data-agnostic unconstrained features model, our work adopts a data-specific perspective, analyzing NC1 in a three-layer neural network, with the first two layers operating in the mean-field regime and followed by a linear layer. In particular, we establish a fundamental connection between NC1 and the loss landscape: we prove that points with small empirical loss and gradient norm (thus, close to being stationary) approximately satisfy NC1, and the closeness to NC1 is controlled by the residual loss and gradient norm. We then show that (i) gradient flow on the mean squared error converges to NC1 solutions with small empirical loss, and (ii) for well-separated data distributions, both NC1 and vanishing test loss are achieved simultaneously. This aligns with the empirical observation that NC1 emerges during training while models attain near-zero test error. Overall, our results demonstrate that NC1 arises from gradient training due to the properties of the loss landscape, and they show the co-occurrence of NC1 and small test error for certain data distributions.

Paper Structure

This paper contains 39 sections, 25 theorems, 224 equations, 2 figures, 1 algorithm.

Key Result

Theorem 4.2

Under Assumption asm:mf-converge, for any ${\epsilon}_S$-stationary point $(\rho,W)$, we have the following characterization of the learned feature: where and the kernel $K_\rho(X,X) \in \mathbb{R}^{n \times n}$ induced by $\rho$ is As a consequence, if $W$ is non-singular, we have where

Figures (2)

  • Figure 1: Average training loss (blue), NC1 (orange) and gradient norm (green) during SGD training. We report the average for 4 independent experiments, as well as the confidence interval at 1 standard deviation.
  • Figure 2: Normalized balancedness (see \ref{['eqn:norm_balanced']}) as a function of the number of training epochs, with each color representing an independent experiment.

Theorems & Definitions (47)

  • Definition 4.1
  • Theorem 4.2
  • proof : Proof sketch
  • Lemma 4.3
  • Corollary 4.4
  • Lemma 4.5
  • Lemma 4.6
  • Lemma 4.7
  • Theorem 4.8
  • proof : Proof sketch
  • ...and 37 more