Understanding Self-Supervised Learning via Gaussian Mixture Models
Parikshit Bansal, Ali Kavis, Sujay Sanghavi
TL;DR
This work analyzes self-supervised representation learning through the lens of Gaussian Mixture Models, defining augmentations as paired samples drawn from the same mixture component and evaluating how contrastive and non-contrastive losses recover informative subspaces. It proves that InfoNCE (and, under suitable conditions, SimSiam) can recover the Fisher subspace for shared-covariance GMMs, going beyond traditional spectral methods which may fail for non-spherical covariances. The analysis extends to multi-modal data (e.g., CLIPGMM), showing that cross-modal contrastive learning learns a subset of the Fisher-optimal subspaces for each modality, effectively filtering noise directions. Synthetic experiments corroborate the theory, demonstrating robustness to augmentation noise, dimensionality effects, and variance structure, and highlighting the superiority of augmentation-enabled self-supervision over purely spectral approaches. Overall, the paper provides principled, provable guarantees for why and when self-supervised contrastive objectives yield Fisher-discriminant subspaces in structured probabilistic models, with implications for linear representations and multi-modal learning.
Abstract
Self-supervised learning attempts to learn representations from un-labeled data; it does so via a loss function that encourages the embedding of a point to be close to that of its augmentations. This simple idea performs remarkably well, yet it is not precisely theoretically understood why this is the case. In this paper we analyze self-supervised learning in a natural context: dimensionality reduction in Gaussian Mixture Models. Crucially, we define an augmentation of a data point as being another independent draw from the same underlying mixture component. We show that vanilla contrastive learning (specifically, the InfoNCE loss) is able to find the optimal lower-dimensional subspace even when the Gaussians are not isotropic -- something that vanilla spectral techniques cannot do. We also prove a similar result for "non-contrastive" self-supervised learning (i.e., SimSiam loss). We further extend our analyses to multi-modal contrastive learning algorithms (e.g., CLIP). In this setting we show that contrastive learning learns the subset of fisher-optimal subspace, effectively filtering out all the noise from the learnt representations. Finally, we corroborate our theoretical finding through synthetic data experiments.
