StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners
Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, Dilip Krishnan
TL;DR
This work investigates learning visual representations from synthetic images produced by text-to-image models, using Stable Diffusion prompts and a novel multi-positive contrastive objective, StableRep. It demonstrates that SSL on synthetic data can match or exceed real-data SSL, and that language supervision further boosts transfer, even surpassing CLIP under certain scales. The key contributions include (i) showing synthetic data viability with optimal guidance scales, (ii) introducing a multi-positive loss that leverages multiple images per caption, and (iii) demonstrating strong linear transfer and few-shot performance across diverse datasets, with favorable comparisons to CLIP when prompts and captions are aligned. The findings suggest a promising direction for reducing real-data requirements in representation learning, while highlighting limitations such as semantic alignment, speed, and biases inherent to generative models.
Abstract
We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.
