Table of Contents
Fetching ...

Preventing Dimensional Collapse in Self-Supervised Learning via Orthogonality Regularization

Junlin He, Jinxiao Du, Wei Ma

TL;DR

This work first time proposes a mitigation approach employing orthogonal regularization (OR) across the encoder, targeting both convolutional and linear layers during pretraining, thus safeguarding against the dimensional collapse of weight matrices, hidden features, and representations.

Abstract

Self-supervised learning (SSL) has rapidly advanced in recent years, approaching the performance of its supervised counterparts through the extraction of representations from unlabeled data. However, dimensional collapse, where a few large eigenvalues dominate the eigenspace, poses a significant obstacle for SSL. When dimensional collapse occurs on features (e.g. hidden features and representations), it prevents features from representing the full information of the data; when dimensional collapse occurs on weight matrices, their filters are self-related and redundant, limiting their expressive power. Existing studies have predominantly concentrated on the dimensional collapse of representations, neglecting whether this can sufficiently prevent the dimensional collapse of the weight matrices and hidden features. To this end, we first time propose a mitigation approach employing orthogonal regularization (OR) across the encoder, targeting both convolutional and linear layers during pretraining. OR promotes orthogonality within weight matrices, thus safeguarding against the dimensional collapse of weight matrices, hidden features, and representations. Our empirical investigations demonstrate that OR significantly enhances the performance of SSL methods across diverse benchmarks, yielding consistent gains with both CNNs and Transformer-based architectures.

Preventing Dimensional Collapse in Self-Supervised Learning via Orthogonality Regularization

TL;DR

This work first time proposes a mitigation approach employing orthogonal regularization (OR) across the encoder, targeting both convolutional and linear layers during pretraining, thus safeguarding against the dimensional collapse of weight matrices, hidden features, and representations.

Abstract

Self-supervised learning (SSL) has rapidly advanced in recent years, approaching the performance of its supervised counterparts through the extraction of representations from unlabeled data. However, dimensional collapse, where a few large eigenvalues dominate the eigenspace, poses a significant obstacle for SSL. When dimensional collapse occurs on features (e.g. hidden features and representations), it prevents features from representing the full information of the data; when dimensional collapse occurs on weight matrices, their filters are self-related and redundant, limiting their expressive power. Existing studies have predominantly concentrated on the dimensional collapse of representations, neglecting whether this can sufficiently prevent the dimensional collapse of the weight matrices and hidden features. To this end, we first time propose a mitigation approach employing orthogonal regularization (OR) across the encoder, targeting both convolutional and linear layers during pretraining. OR promotes orthogonality within weight matrices, thus safeguarding against the dimensional collapse of weight matrices, hidden features, and representations. Our empirical investigations demonstrate that OR significantly enhances the performance of SSL methods across diverse benchmarks, yielding consistent gains with both CNNs and Transformer-based architectures.

Paper Structure

This paper contains 24 sections, 1 theorem, 6 equations, 6 figures, 10 tables.

Key Result

Proposition 1

For a specific weight matrix $W \in \mathbb{R}^{input \times output}$ and $X \in \mathbb{R}^{N \times input}$, comprising $N$ samples each of dimensionality $input$. We denote $\Bar{X}$ and $\Bar{S}$ as the sample means of $X$ and $S$, respectively. Let $S = XW$, where $W^TW = I$. The covariance mat

Figures (6)

  • Figure 1: Illustration of dimensional collapse in SSL. We use one augmented input $X_{aug1}$ as an example: we assume that the encoder contains two basic blocks, each containing a linear operation (e.g., a linear layer or convolutional layer) and an activation function. Dimensional collapse can occur in weight matrices ($W_1, W_2$), hidden features, and the finally obtained representations. Existing methods act directly on representations and expect to affect hidden features and weight matrices indirectly, which has no guarantee in theory; our method directly constrains weight matrices and indirectly influences hidden features and representations, which can be guaranteed by theoretical analysis.
  • Figure 2: Illustration of joint-embedding SSL methods. This is a general structure. Different augmented inputs can be passed either by shared weight Encoder and Projector or by independent Encoder and Projector, depending on different SSL methods.
  • Figure 3: Eigenspectra of both weights and features within the encoder (ResNet18). The features are collected on the first batch of the test set (batchsize 4,096). We pretrain BYOL without OR, with feature whitening from VICREG, and with OR on CIFAR-10. The x-axis and y-axis are both log-scaled. The solid line represents that all eigenvalues are positive, the dashed line represents the existence of eigenvalues that are non-positive, and the number of eigenvalues is represented behind the underline.
  • Figure 4: Eigenspectra of both weights and features within the encoder (ResNet18). The features are collected on the first batch of the test set (batchsize 4096). We pretrain original VICREG, VICREG without projecto, and VICREG with OR on CIFAR-10. The x-axis and y-axis are both log-scaled. The solid line represents that all eigenvalues are positive, the dashed line represents the existence of eigenvalues that are non-positive, and the number of eigenvalues is represented behind the underline.
  • Figure 5: HeatMap is the visualization of the absolute value of the correlation coefficients among filters of the weight matrix (layer4). Biclustering is the visualization of the results of spectral biclustering. It can be seen that OR significantly reduces the correlation and removes the clustering patterns among filters from the heatmap and biclustering, respectively.
  • ...and 1 more figures

Theorems & Definitions (2)

  • Definition 1: Normalized eigenvalues
  • Proposition 1