Table of Contents
Fetching ...

Towards understanding neural collapse in supervised contrastive learning with the information bottleneck method

Siwei Wang, Stephanie E Palmer

TL;DR

The paper investigates how neural collapse in supervised contrastive learning relates to generalization by framing the phenomenon as an information bottleneck (IB) problem. Leveraging linear identifiability between independently trained encoders, it applies a Gaussian information bottleneck (GIB) proxy via a Meta-Gaussian IB to show that classification information concentrates into a $K$-dimensional Gaussian representation, with class means forming a $K$-simplex ETF. This $K$-dimensional IB-optimal geometry emerges during training and aligns with improved generalization, and the same ECM structure appears in compressed representations even when using zero-shot transfer with ImageNet32. Overall, the work connects NC, linear identifiability, and optimal IB solutions to explain and predict generalization performance in supervised contrastive learning, suggesting a universal low-dimensional geometry for efficient information coding of class labels.

Abstract

Neural collapse describes the geometry of activation in the final layer of a deep neural network when it is trained beyond performance plateaus. Open questions include whether neural collapse leads to better generalization and, if so, why and how training beyond the plateau helps. We model neural collapse as an information bottleneck (IB) problem in order to investigate whether such a compact representation exists and discover its connection to generalization. We demonstrate that neural collapse leads to good generalization specifically when it approaches an optimal IB solution of the classification problem. Recent research has shown that two deep neural networks independently trained with the same contrastive loss objective are linearly identifiable, meaning that the resulting representations are equivalent up to a matrix transformation. We leverage linear identifiability to approximate an analytical solution of the IB problem. This approximation demonstrates that when class means exhibit $K$-simplex Equiangular Tight Frame (ETF) behavior (e.g., $K$=10 for CIFAR10 and $K$=100 for CIFAR100), they coincide with the critical phase transitions of the corresponding IB problem. The performance plateau occurs once the optimal solution for the IB problem includes all of these phase transitions. We also show that the resulting $K$-simplex ETF can be packed into a $K$-dimensional Gaussian distribution using supervised contrastive learning with a ResNet50 backbone. This geometry suggests that the $K$-simplex ETF learned by supervised contrastive learning approximates the optimal features for source coding. Hence, there is a direct correspondence between optimal IB solutions and generalization in contrastive learning.

Towards understanding neural collapse in supervised contrastive learning with the information bottleneck method

TL;DR

The paper investigates how neural collapse in supervised contrastive learning relates to generalization by framing the phenomenon as an information bottleneck (IB) problem. Leveraging linear identifiability between independently trained encoders, it applies a Gaussian information bottleneck (GIB) proxy via a Meta-Gaussian IB to show that classification information concentrates into a -dimensional Gaussian representation, with class means forming a -simplex ETF. This -dimensional IB-optimal geometry emerges during training and aligns with improved generalization, and the same ECM structure appears in compressed representations even when using zero-shot transfer with ImageNet32. Overall, the work connects NC, linear identifiability, and optimal IB solutions to explain and predict generalization performance in supervised contrastive learning, suggesting a universal low-dimensional geometry for efficient information coding of class labels.

Abstract

Neural collapse describes the geometry of activation in the final layer of a deep neural network when it is trained beyond performance plateaus. Open questions include whether neural collapse leads to better generalization and, if so, why and how training beyond the plateau helps. We model neural collapse as an information bottleneck (IB) problem in order to investigate whether such a compact representation exists and discover its connection to generalization. We demonstrate that neural collapse leads to good generalization specifically when it approaches an optimal IB solution of the classification problem. Recent research has shown that two deep neural networks independently trained with the same contrastive loss objective are linearly identifiable, meaning that the resulting representations are equivalent up to a matrix transformation. We leverage linear identifiability to approximate an analytical solution of the IB problem. This approximation demonstrates that when class means exhibit -simplex Equiangular Tight Frame (ETF) behavior (e.g., =10 for CIFAR10 and =100 for CIFAR100), they coincide with the critical phase transitions of the corresponding IB problem. The performance plateau occurs once the optimal solution for the IB problem includes all of these phase transitions. We also show that the resulting -simplex ETF can be packed into a -dimensional Gaussian distribution using supervised contrastive learning with a ResNet50 backbone. This geometry suggests that the -simplex ETF learned by supervised contrastive learning approximates the optimal features for source coding. Hence, there is a direct correspondence between optimal IB solutions and generalization in contrastive learning.
Paper Structure (15 sections, 2 theorems, 4 equations, 4 figures, 3 tables)

This paper contains 15 sections, 2 theorems, 4 equations, 4 figures, 3 tables.

Key Result

Proposition 2.3

Consider learned representations $Z_1$ and $Z_2$ with a Gaussian covariance structure and arbitrary margins [rey2014Rey2012] (see Supplementary Information for details) where $F(Z) = {F_{Z_{1,i}}}$ or ${F_{Z_{2,i}}}$ are the marginal distributions of $Z_1$, $Z_2$ and $C_G$ is a Gaussian copula parameterized by a correlation matrix $G$. The optimum of the minimization problem eq:IB is obtained for

Figures (4)

  • Figure 1: IB ties compression to better generalization. The IB information curve represents optimal classification across $H(Y)$. We hypothesize that neural collapse exists in the elbow of the IB curve, where the $K$-simplex ETF emerges after the IB optimal solution includes all critical phase transitions $\beta_{1,\cdots,K}$ at equivalent noise levels.
  • Figure 2: The emergence of neural collapse, shown via variability collapse (NC1), improves linear identifiability. We measure the linear identifiability between two learned representations using the average of their CCA coefficients Raghu2017. Using CCA coefficients to measure linear similarity was proposed in Roeder2020. X-axis shows the variance of $K$-leading CCA coefficients for $K$-class datasets (e.g., $K$=10 for CIFAR-10 and $K$=100 for CIFAR100). The Y-axis is the metric for variability collapse. Each dot is the mean of the CCA and the error bar shows the respective standard deviation. Colors are epochs. We also shade the area with less than 2% training error (the terminal phase of training Ekambaram2017Mueller2019). a) CIFAR-10; b) CIFAR-100;
  • Figure 3: Within the $K$-dimensional IB optimal Gaussian distribution for both CIFAR10 and CIFAR100, we observe that class means exhibit K-simplex ETF comparable to the K-simplex ETF in the full 2048-D representation space. A) As training progresses, the $K$ dimensional IB optimal representation retains most of the classification performance for CIFAR10. B) As training progresses, the standard deviation of norm for all class means, i.e., $Std_k(||\mu_k-\mu_{all}||_2/Avg(||\mu_k-\mu_{all}||_2))$ gets smaller within the $K$-dimensional Gaussian distribution (shown in solid line). The diminishing effect is more salient than those observed in the 2048-D full representation space (shown in dashed line). C) The standard deviation of angles between class means, i.e., $\theta_{\mu_{c,c'}}$ for CIFAR10 gets smaller as training progresses. The magnitude of the difference between $K$-dimensional IB optimal Gaussian distribution (shown in solid line) and the full 2048-D representation space (shown in dashed line) is small, i.e., $5^\circ$. D) The cosine of mean angles between class activations converge to $(-1/(K-1))$ (K=10 for CIFAR10), i.e. $\cos{\theta_{\mu_{c,c'}}}\rightarrow (-1/(K-1))$. Again, the $K$-dimensional IB optimal Gaussian distribution behaves comparable to the full 2048-D representation space in placing different class means to pairwise angles close to $\cos^{-1}{(-1/(K-1))}$. E)-H) K-simplex ETF observations for CIFAR100.
  • Figure 4: A) The zero-shot transfer learning for CIFAR10 needs nearly 70 dimensions to enable classifiers to search for nearest class means (NC4) and retain classification performance. B) Feature scaling for the $K$-dimensional Gaussian distribution that fits the corresponding $K$-simplex ETF optimal for CIFAR10. Most of the feature scaling are between 0.9 and 1.1 with respect to the mean. Similar scalings between features indicates that each IB dimension corresponds to similar phase transition coefficients and contributes nearly equally to encode $H(Y)-\delta$. This is compatible with the geometry of $K$-simplex ETF. C) In the zero-shot transfer learning scenario, we may find a $K'$-dimensional Gaussian distribution that retains most of the classification information for CIFAR10. While $K' > K$ ($K' = 70$ and $K=10$ for CIFAR10), all feature scalings are similar (also between 0.9 and 1.1), indicating that the each IB dimension contributes near equivalently. Note that this suggests models trained with different datasets characterizes the same CIFAR10 with a similar geometry (i.e., the simplex ETF). Whether such geometry is a universal solution for reprsentation learning is an interesting future direction.

Theorems & Definitions (3)

  • Definition 2.1
  • Proposition 2.3: Optimality of Meta-Gaussian Information bottleneck
  • Theorem 2.4: Optimal solution for the Gaussian Information Bottleneck