Can Kernel Methods Explain How the Data Affects Neural Collapse?
Vignesh Kothapalli, Tom Tirer
TL;DR
The paper investigates Neural Collapse NC1 through kernel methods, demonstrating that data-independent NTK does not produce more feature collapse than NNGP on Gaussian data. It introduces a kernel-based NC1 metric and analyzes both limiting kernels (NNGP and NTK) and data-aware GP kernels via Equations of State (EoS) to model finite-width feature learning. The results show that data-adaptive kernels can yield lower NC1 and better align with shallow NN NC1 in some regimes, while activation choice (ERF vs ReLU) materially affects NC1; however, EoS-based methods may diverge in imbalanced or nonlinearly separable high-dimensional settings. Overall, the work highlights the potential of data-aware kernel models to advance NC analysis, while underscoring limitations of NTK and the need for further theoretical and numerical development.
Abstract
A vast amount of literature has recently focused on the "Neural Collapse" (NC) phenomenon, which emerges when training neural network (NN) classifiers beyond the zero training error point. The core component of NC is the decrease in the within-class variability of the network's deepest features, dubbed as NC1. The theoretical works that study NC are typically based on simplified unconstrained features models (UFMs) that mask any effect of the data on the extent of collapse. To address this limitation of UFMs, this paper explores the possibility of analyzing NC1 using kernels associated with shallow NNs. We begin by formulating an NC1 metric as a function of the kernel. Then, we specialize it to the NN Gaussian Process kernel (NNGP) and the Neural Tangent Kernel (NTK), associated with wide networks at initialization and during gradient-based training with a small learning rate, respectively. As a key result, we show that the NTK does not represent more collapsed features than the NNGP for Gaussian data of arbitrary dimensions. This showcases the limitations of data-independent kernels such as NTK in approximating the NC behavior of NNs. As an alternative to NTK, we then empirically explore a recently proposed data-aware Gaussian Process kernel, which generalizes NNGP to model feature learning. We show that this kernel yields lower NC1 than NNGP but may not follow the trends of the shallow NN. Our study demonstrates that adaptivity to data may allow kernel-based analysis of NC, though further advancements in this area are still needed. A nice byproduct of our study is showing both theoretically and empirically that the choice of nonlinear activation function affects NC1 (with ERF yielding lower values than ReLU). The code is available at: https://github.com/kvignesh1420/shallow_nc1
