Can Generative Models Improve Self-Supervised Representation Learning?
Sana Ayromlou, Vahid Reza Khazaie, Fereshteh Forghani, Arash Afkanpour
TL;DR
This paper addresses the limited diversity of traditional SSL augmentations by introducing instance-conditioned generative augmentations that preserve semantic content while expanding visual variation. By integrating conditional generators (Stable Diffusion and ICGAN) with existing joint-embedding SSL methods (e.g., SimCLR, BYOL, MoCo, SimSiam, Barlow Twins) and using offline generated samples, the approach yields consistent improvements in downstream linear-probing accuracy on ImageNet and other datasets, with gains up to about 10%. The study includes a dissimilarity analysis (CKA, OPD) showing that the learned representations differ from those of CLIP, confirming that the augmented SSL space is not merely replicating pretrained encodings. The work highlights practical benefits of synthetic data for SSL while outlining future directions such as co-training the generator with the SSL model and addressing ethical considerations related to biases in generative models.
Abstract
The rapid advancement in self-supervised representation learning has highlighted its potential to leverage unlabeled data for learning rich visual representations. However, the existing techniques, particularly those employing different augmentations of the same image, often rely on a limited set of simple transformations that cannot fully capture variations in the real world. This constrains the diversity and quality of samples, which leads to sub-optimal representations. In this paper, we introduce a framework that enriches the self-supervised learning (SSL) paradigm by utilizing generative models to produce semantically consistent image augmentations. By directly conditioning generative models on a source image, our method enables the generation of diverse augmentations while maintaining the semantics of the source image, thus offering a richer set of data for SSL. Our extensive experimental results on various joint-embedding SSL techniques demonstrate that our framework significantly enhances the quality of learned visual representations by up to 10\% Top-1 accuracy in downstream tasks. This research demonstrates that incorporating generative models into the joint-embedding SSL workflow opens new avenues for exploring the potential of synthetic data. This development paves the way for more robust and versatile representation learning techniques.
