Table of Contents
Fetching ...

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

Yonglong Tian, Lijie Fan, Phillip Isola, Huiwen Chang, Dilip Krishnan

TL;DR

This work investigates learning visual representations from synthetic images produced by text-to-image models, using Stable Diffusion prompts and a novel multi-positive contrastive objective, StableRep. It demonstrates that SSL on synthetic data can match or exceed real-data SSL, and that language supervision further boosts transfer, even surpassing CLIP under certain scales. The key contributions include (i) showing synthetic data viability with optimal guidance scales, (ii) introducing a multi-positive loss that leverages multiple images per caption, and (iii) demonstrating strong linear transfer and few-shot performance across diverse datasets, with favorable comparisons to CLIP when prompts and captions are aligned. The findings suggest a promising direction for reducing real-data requirements in representation learning, while highlighting limitations such as semantic alignment, speed, and biases inherent to generative models.

Abstract

We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.

StableRep: Synthetic Images from Text-to-Image Models Make Strong Visual Representation Learners

TL;DR

This work investigates learning visual representations from synthetic images produced by text-to-image models, using Stable Diffusion prompts and a novel multi-positive contrastive objective, StableRep. It demonstrates that SSL on synthetic data can match or exceed real-data SSL, and that language supervision further boosts transfer, even surpassing CLIP under certain scales. The key contributions include (i) showing synthetic data viability with optimal guidance scales, (ii) introducing a multi-positive loss that leverages multiple images per caption, and (iii) demonstrating strong linear transfer and few-shot performance across diverse datasets, with favorable comparisons to CLIP when prompts and captions are aligned. The findings suggest a promising direction for reducing real-data requirements in representation learning, while highlighting limitations such as semantic alignment, speed, and biases inherent to generative models.

Abstract

We investigate the potential of learning visual representations using synthetic images generated by text-to-image models. This is a natural question in the light of the excellent performance of such models in generating high-quality images. We consider specifically the Stable Diffusion, one of the leading open source text-to-image models. We show that (1) when the generative model is configured with proper classifier-free guidance scale, training self-supervised methods on synthetic images can match or beat the real image counterpart; (2) by treating the multiple images generated from the same text prompt as positives for each other, we develop a multi-positive contrastive learning method, which we call StableRep. With solely synthetic images, the representations learned by StableRep surpass the performance of representations learned by SimCLR and CLIP using the same set of text prompts and corresponding real images, on large scale datasets. When we further add language supervision, StableRep trained with 20M synthetic images achieves better accuracy than CLIP trained with 50M real images.
Paper Structure (27 sections, 4 equations, 9 figures, 19 tables, 1 algorithm)

This paper contains 27 sections, 4 equations, 9 figures, 19 tables, 1 algorithm.

Figures (9)

  • Figure 1: Left: traditional visual representation learning relies on a dataset of real images to train an image embedding function. Right: we view generative models as datasets that allow us to sample images from the data distribution. In our study, we leverage text-to-image models (Stable Diffusion ldm) and treat multiple images synthesized from the same prompt as positives for contrastive representation learning.
  • Figure 2: Performance of linear probes on ImageNet as a function of the guidance scale of Stable Diffusion generation. Left: using SimCLR as pre-training; Right: using MAE as pre-training. In both cases, we see pre-training on synthetic images that are generated by Stable Diffusion with a guidance scale between 6 and 8, gives a significant boost over training only on real images. We used the CC3M dataset for these experiments.
  • Figure 3: Training self-supervised methods on synthetic images can be better than, or on par with, real images of the same sample size. Left: CC3M dataset; Right: CC12M dataset
  • Figure 4: We compare our pipeline (C) to that of (A) SimCLR; (B) CLIP. In SimCLR, the real image is augmented to give two views which are contrasted against each other through the same encoder. For CLIP, a real image and corresponding real caption are passed into image and text encoder, the image is augmented (usually more weakly than for SimCLR) followed by a contrastive loss. In our pipeline, each real caption is passed into Stable Diffusion (SD) to generate a number of synthetic images. These synthetic images are then augmented as in SimCLR, and treated as positives for each other in a multi-positive contrastive loss.
  • Figure 5: ImageNet zero-shot accuracy with different Stable Diffusion generation guidance scale $w$, using CLIP as pre-training.
  • ...and 4 more figures