Table of Contents
Fetching ...

Can Kernel Methods Explain How the Data Affects Neural Collapse?

Vignesh Kothapalli, Tom Tirer

TL;DR

The paper investigates Neural Collapse NC1 through kernel methods, demonstrating that data-independent NTK does not produce more feature collapse than NNGP on Gaussian data. It introduces a kernel-based NC1 metric and analyzes both limiting kernels (NNGP and NTK) and data-aware GP kernels via Equations of State (EoS) to model finite-width feature learning. The results show that data-adaptive kernels can yield lower NC1 and better align with shallow NN NC1 in some regimes, while activation choice (ERF vs ReLU) materially affects NC1; however, EoS-based methods may diverge in imbalanced or nonlinearly separable high-dimensional settings. Overall, the work highlights the potential of data-aware kernel models to advance NC analysis, while underscoring limitations of NTK and the need for further theoretical and numerical development.

Abstract

A vast amount of literature has recently focused on the "Neural Collapse" (NC) phenomenon, which emerges when training neural network (NN) classifiers beyond the zero training error point. The core component of NC is the decrease in the within-class variability of the network's deepest features, dubbed as NC1. The theoretical works that study NC are typically based on simplified unconstrained features models (UFMs) that mask any effect of the data on the extent of collapse. To address this limitation of UFMs, this paper explores the possibility of analyzing NC1 using kernels associated with shallow NNs. We begin by formulating an NC1 metric as a function of the kernel. Then, we specialize it to the NN Gaussian Process kernel (NNGP) and the Neural Tangent Kernel (NTK), associated with wide networks at initialization and during gradient-based training with a small learning rate, respectively. As a key result, we show that the NTK does not represent more collapsed features than the NNGP for Gaussian data of arbitrary dimensions. This showcases the limitations of data-independent kernels such as NTK in approximating the NC behavior of NNs. As an alternative to NTK, we then empirically explore a recently proposed data-aware Gaussian Process kernel, which generalizes NNGP to model feature learning. We show that this kernel yields lower NC1 than NNGP but may not follow the trends of the shallow NN. Our study demonstrates that adaptivity to data may allow kernel-based analysis of NC, though further advancements in this area are still needed. A nice byproduct of our study is showing both theoretically and empirically that the choice of nonlinear activation function affects NC1 (with ERF yielding lower values than ReLU). The code is available at: https://github.com/kvignesh1420/shallow_nc1

Can Kernel Methods Explain How the Data Affects Neural Collapse?

TL;DR

The paper investigates Neural Collapse NC1 through kernel methods, demonstrating that data-independent NTK does not produce more feature collapse than NNGP on Gaussian data. It introduces a kernel-based NC1 metric and analyzes both limiting kernels (NNGP and NTK) and data-aware GP kernels via Equations of State (EoS) to model finite-width feature learning. The results show that data-adaptive kernels can yield lower NC1 and better align with shallow NN NC1 in some regimes, while activation choice (ERF vs ReLU) materially affects NC1; however, EoS-based methods may diverge in imbalanced or nonlinearly separable high-dimensional settings. Overall, the work highlights the potential of data-aware kernel models to advance NC analysis, while underscoring limitations of NTK and the need for further theoretical and numerical development.

Abstract

A vast amount of literature has recently focused on the "Neural Collapse" (NC) phenomenon, which emerges when training neural network (NN) classifiers beyond the zero training error point. The core component of NC is the decrease in the within-class variability of the network's deepest features, dubbed as NC1. The theoretical works that study NC are typically based on simplified unconstrained features models (UFMs) that mask any effect of the data on the extent of collapse. To address this limitation of UFMs, this paper explores the possibility of analyzing NC1 using kernels associated with shallow NNs. We begin by formulating an NC1 metric as a function of the kernel. Then, we specialize it to the NN Gaussian Process kernel (NNGP) and the Neural Tangent Kernel (NTK), associated with wide networks at initialization and during gradient-based training with a small learning rate, respectively. As a key result, we show that the NTK does not represent more collapsed features than the NNGP for Gaussian data of arbitrary dimensions. This showcases the limitations of data-independent kernels such as NTK in approximating the NC behavior of NNs. As an alternative to NTK, we then empirically explore a recently proposed data-aware Gaussian Process kernel, which generalizes NNGP to model feature learning. We show that this kernel yields lower NC1 than NNGP but may not follow the trends of the shallow NN. Our study demonstrates that adaptivity to data may allow kernel-based analysis of NC, though further advancements in this area are still needed. A nice byproduct of our study is showing both theoretically and empirically that the choice of nonlinear activation function affects NC1 (with ERF yielding lower values than ReLU). The code is available at: https://github.com/kvignesh1420/shallow_nc1
Paper Structure (42 sections, 7 theorems, 116 equations, 13 figures)

This paper contains 42 sections, 7 theorems, 116 equations, 13 figures.

Key Result

Theorem 4.1

For any two data points $\mathbf{x}^{c,i}, \mathbf{x}^{c',j}$, let the inner-product of their associated features $\mathbf{h}^{c,i}, \mathbf{h}^{c',j}$ be given by a kernel $Q : \mathbb{R}^{d_0} \times \mathbb{R}^{d_0} \to \mathbb{R}$ as $Q(\mathbf{x}^{c,i},\mathbf{x}^{c',j})=\mathbf{h}^{c,i \top}\m

Figures (13)

  • Figure 1: Visualizing the kernel matrix $\mathbf{Q}$ for the limiting NNGP post-activation kernel function $Q_{GP-Erf}: \mathbb{R}^2 \times \mathbb{R}^2 \to \mathbb{R}$. The data is sampled from two Gaussian distributions in (a) balanced and (b) imbalanced fashion to illustrate the structure of the sub-matrices $\mathbf{Q}_{c,c'}, c,c' \in \{1,2\}.$
  • Figure 2: $\mathcal{N}\mathcal{C}_1(\mathbf{H})$ of (a) the post-activation NNGP kernel ($Q^{(1)}_{GP-Erf}$), (b) NTK ($\Theta^{(2)}_{NTK-Erf}$), (c) 2L-FCN with $d_1=500$ for Erf activation on dataset $\mathcal{D}_1(N, d_0)$ (as per \ref{['eq:dataset_D1']}).
  • Figure 3: $\mathcal{N}\mathcal{C}_1(\mathbf{H})$ of (a) the post-activation NNGP kernel ($Q^{(1)}_{GP-ReLU}$), (b) NTK ($\Theta^{(2)}_{NTK-ReLU}$), (c) 2L-FCN with $d_1=500$ for ReLU activation on dataset $\mathcal{D}_1(N, d_0)$ (as per \ref{['eq:dataset_D1']}).
  • Figure 4: $\mathcal{N}\mathcal{C}_1(\mathbf{H})$ of the $\mathbf{Q}^{(1)}$ kernel obtained by solving the EoS (a) $d_1=2000$ (b) $d_1=500$ on $\mathcal{D}_1(N, d_0)$.
  • Figure 5: $\mathcal{N}\mathcal{C}_1(\mathbf{H})$ of the limiting kernels, adaptive kernel (EoS) with final annealing factor $d_1=500$ and 2L-FCN with $d_1=500$ and Erf activation. The dimension $d_0$ on the x-axis is chosen from $\{8, 16, 32, 64, 128\}$. For a particular $N$, we sample the vectors $\mathbf{x}^{1,i} \sim \mathcal{N}( -2*\mathbf{1}_{d_0}, 4*\mathbf{I}_{d_0}), y^{1,i} = -1, i \in [N/2]$ for class $1$ and $\mathbf{x}^{2,j} \sim \mathcal{N}( 2*\mathbf{1}_{d_0}, 4*\mathbf{I}_{d_0}), y^{2, j} = 1, j \in [N/2]$ for class $2$.
  • ...and 8 more figures

Theorems & Definitions (14)

  • Theorem 4.1
  • Theorem 5.1: ReLU Activation
  • Definition 6.1
  • Lemma C.1
  • proof
  • Lemma C.2
  • proof
  • Lemma C.3
  • proof
  • Lemma C.4
  • ...and 4 more