Table of Contents
Fetching ...

Understanding Dimensional Collapse in Contrastive Self-supervised Learning

Li Jing, Pascal Vincent, Yann LeCun, Yuandong Tian

TL;DR

The paper reveals that dimensional collapse—embedding vectors spanning only a low-dimensional subspace—persists in contrastive self-supervised learning and identifies two driving mechanisms: strong augmentation and implicit regularization in deep networks. It provides a theoretical framework describing gradient-flow dynamics and weight alignment that lead to low-rank representations. Motivated by these insights, it introduces DirectCLR, a projector-free contrastive method that operates on a fixed subvector of the representation to directly optimize the embedding space. On ImageNet, DirectCLR outperforms SimCLR with a trainable linear projector, demonstrating the practicality of bypassing the explicit projector. These results offer a new perspective on SSL dynamics and suggest projector design as a crucial lever for controlling representation geometry.

Abstract

Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on an explicit trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.

Understanding Dimensional Collapse in Contrastive Self-supervised Learning

TL;DR

The paper reveals that dimensional collapse—embedding vectors spanning only a low-dimensional subspace—persists in contrastive self-supervised learning and identifies two driving mechanisms: strong augmentation and implicit regularization in deep networks. It provides a theoretical framework describing gradient-flow dynamics and weight alignment that lead to low-rank representations. Motivated by these insights, it introduces DirectCLR, a projector-free contrastive method that operates on a fixed subvector of the representation to directly optimize the embedding space. On ImageNet, DirectCLR outperforms SimCLR with a trainable linear projector, demonstrating the practicality of bypassing the explicit projector. These results offer a new perspective on SSL dynamics and suggest projector design as a crucial lever for controlling representation geometry.

Abstract

Self-supervised visual representation learning aims to learn useful representations without relying on human annotations. Joint embedding approach bases on maximizing the agreement between embedding vectors from different views of the same image. Various methods have been proposed to solve the collapsing problem where all embedding vectors collapse to a trivial constant solution. Among these methods, contrastive learning prevents collapse via negative sample pairs. It has been shown that non-contrastive methods suffer from a lesser collapse problem of a different nature: dimensional collapse, whereby the embedding vectors end up spanning a lower-dimensional subspace instead of the entire available embedding space. Here, we show that dimensional collapse also happens in contrastive learning. In this paper, we shed light on the dynamics at play in contrastive learning that leads to dimensional collapse. Inspired by our theory, we propose a novel contrastive learning method, called DirectCLR, which directly optimizes the representation space without relying on an explicit trainable projector. Experiments show that DirectCLR outperforms SimCLR with a trainable linear projector on ImageNet.

Paper Structure

This paper contains 30 sections, 14 theorems, 51 equations, 12 figures, 2 tables.

Key Result

Lemma 1

The weight matrix in a linear contrastive self-supervised learning model evolves by: where $G = \sum_i (\textbf{g}_{{\bm{z}}_i}\textbf{x}_i^T + \textbf{g}_{{\bm{z}}_i'}\textbf{x}_i'^T)$, and $\textbf{g}_{\textbf{z}_i}$ is the gradient on the embedding vector $\textbf{z}_i$ (similarly $\textbf{g}_{\textbf{z}_i'}$).

Figures (12)

  • Figure 1: Illustration of the collapsing problem. For complete collapse, the embedding vectors collapse to same point. For dimensional collapse, the embedding vectors only span a lower dimensional space.
  • Figure 2: Singular value spectrum of the embedding space. The embedding vectors are computed from a pretrained SimCLR model on the validation set of ImageNet. Each embedding vector has a dimension of 128. The spectrum contains the singular values of the covariance matrix of these embedding vectors in sorted order and logarithmic scale. A number of singular values drop to zero, indicating collapsed dimensions.
  • Figure 3: Weight matrix singular value spectrum with different augmentation amplitude $k$. The setting is a single layer linear toy model with each weight matrix of the size of 16x16, where the block has the size of 8x8. Strong augmentation results in vanishing singular values in weight matrices.
  • Figure 4: Two-layer Linear Model
  • Figure 5: Visualization of the alignment matrix $A=V_2^TU_1$ after training. The setting is a 2-layer linear toy model with each weight matrix of the size of 16x16. The alignment matrix converges to an identity matrix.
  • ...and 7 more figures

Theorems & Definitions (23)

  • Lemma 1
  • Lemma 2
  • Theorem 1
  • Corollary 1: Dimensional Collapse Caused by Strong Augmentation
  • Lemma 3
  • Theorem 2: Weight matrices align
  • Theorem 3
  • Corollary 2: Dimensional Collapse Caused by Implicit Regularization
  • Proposition 1
  • Proposition 2
  • ...and 13 more