Table of Contents
Fetching ...

Understanding Self-Supervised Learning via Gaussian Mixture Models

Parikshit Bansal, Ali Kavis, Sujay Sanghavi

TL;DR

This work analyzes self-supervised representation learning through the lens of Gaussian Mixture Models, defining augmentations as paired samples drawn from the same mixture component and evaluating how contrastive and non-contrastive losses recover informative subspaces. It proves that InfoNCE (and, under suitable conditions, SimSiam) can recover the Fisher subspace for shared-covariance GMMs, going beyond traditional spectral methods which may fail for non-spherical covariances. The analysis extends to multi-modal data (e.g., CLIPGMM), showing that cross-modal contrastive learning learns a subset of the Fisher-optimal subspaces for each modality, effectively filtering noise directions. Synthetic experiments corroborate the theory, demonstrating robustness to augmentation noise, dimensionality effects, and variance structure, and highlighting the superiority of augmentation-enabled self-supervision over purely spectral approaches. Overall, the paper provides principled, provable guarantees for why and when self-supervised contrastive objectives yield Fisher-discriminant subspaces in structured probabilistic models, with implications for linear representations and multi-modal learning.

Abstract

Self-supervised learning attempts to learn representations from un-labeled data; it does so via a loss function that encourages the embedding of a point to be close to that of its augmentations. This simple idea performs remarkably well, yet it is not precisely theoretically understood why this is the case. In this paper we analyze self-supervised learning in a natural context: dimensionality reduction in Gaussian Mixture Models. Crucially, we define an augmentation of a data point as being another independent draw from the same underlying mixture component. We show that vanilla contrastive learning (specifically, the InfoNCE loss) is able to find the optimal lower-dimensional subspace even when the Gaussians are not isotropic -- something that vanilla spectral techniques cannot do. We also prove a similar result for "non-contrastive" self-supervised learning (i.e., SimSiam loss). We further extend our analyses to multi-modal contrastive learning algorithms (e.g., CLIP). In this setting we show that contrastive learning learns the subset of fisher-optimal subspace, effectively filtering out all the noise from the learnt representations. Finally, we corroborate our theoretical finding through synthetic data experiments.

Understanding Self-Supervised Learning via Gaussian Mixture Models

TL;DR

This work analyzes self-supervised representation learning through the lens of Gaussian Mixture Models, defining augmentations as paired samples drawn from the same mixture component and evaluating how contrastive and non-contrastive losses recover informative subspaces. It proves that InfoNCE (and, under suitable conditions, SimSiam) can recover the Fisher subspace for shared-covariance GMMs, going beyond traditional spectral methods which may fail for non-spherical covariances. The analysis extends to multi-modal data (e.g., CLIPGMM), showing that cross-modal contrastive learning learns a subset of the Fisher-optimal subspaces for each modality, effectively filtering noise directions. Synthetic experiments corroborate the theory, demonstrating robustness to augmentation noise, dimensionality effects, and variance structure, and highlighting the superiority of augmentation-enabled self-supervision over purely spectral approaches. Overall, the paper provides principled, provable guarantees for why and when self-supervised contrastive objectives yield Fisher-discriminant subspaces in structured probabilistic models, with implications for linear representations and multi-modal learning.

Abstract

Self-supervised learning attempts to learn representations from un-labeled data; it does so via a loss function that encourages the embedding of a point to be close to that of its augmentations. This simple idea performs remarkably well, yet it is not precisely theoretically understood why this is the case. In this paper we analyze self-supervised learning in a natural context: dimensionality reduction in Gaussian Mixture Models. Crucially, we define an augmentation of a data point as being another independent draw from the same underlying mixture component. We show that vanilla contrastive learning (specifically, the InfoNCE loss) is able to find the optimal lower-dimensional subspace even when the Gaussians are not isotropic -- something that vanilla spectral techniques cannot do. We also prove a similar result for "non-contrastive" self-supervised learning (i.e., SimSiam loss). We further extend our analyses to multi-modal contrastive learning algorithms (e.g., CLIP). In this setting we show that contrastive learning learns the subset of fisher-optimal subspace, effectively filtering out all the noise from the learnt representations. Finally, we corroborate our theoretical finding through synthetic data experiments.

Paper Structure

This paper contains 53 sections, 5 theorems, 64 equations, 4 figures, 1 table.

Key Result

Lemma 4.2

Let $\{w_k,{\boldsymbol{\mu}}_k,{\boldsymbol{\Sigma}}\}_{k\in[K]}$ be a SharedGMM and $\Pr(z=k|{{\boldsymbol {x}}})$ be the posterior probability of ${{\boldsymbol {x}}}$ being drawn from the component $z$. Let $S_F$ be the mixture's Fisher subspace and ${{\boldsymbol {A}}}_F$ be a projection matrix

Figures (4)

  • Figure 1: (a) For spherical Gaussians, $S_{SVD}$ and $S_F$ overlap and hence projection onto $S_{SVD}$ leads to well seperated GMM. (b) (Parallel Pancakes) For general non-spherical Gaussians, large variance in some direction leads to $S_{SVD} \neq S_F$. Hence, projection onto $S_{SVD}$ leads to mode collapse. (c) (Shifted Parallel Pancakes) Mean subspace $S_{{\boldsymbol{\mu}}}$ does not always coincide with $S_F$.
  • Figure 2: We empirically validate our theoretical findings for four main settings (see Sec \ref{['sec:exp']}). Fig (a), shows that self-supervised learning is robust to noise in Augmentation-enabled Distribution . Fig (b), shows that InfoNCE (and SimSiam) are invariant to variance orthogonal to fisher subspace. Fig (c), InfoNCE loss is better than spectral methods for every projection dimension, and Fig (d) shows that InfoNCE loss learns a good scaling within the fisher subspace
  • Figure 3: Caption
  • Figure 4: We further extend the results presented in Fig \ref{['fig:mi_plots']} with orthonormalized InfoNCE and SimSiam mappings

Theorems & Definitions (19)

  • Definition 3.1: SharedGMM
  • Definition 3.2: Fisher Discriminant
  • Remark 3.3
  • Definition 4.1: Fisher Subspace
  • Lemma 4.2
  • Definition 4.3
  • Definition 5.1: Augmentation-enabled Distribution (AeD)
  • Theorem 5.2
  • Remark 5.3
  • Definition 6.1
  • ...and 9 more