Learning Vision from Models Rivals Learning Vision from Data
Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, Phillip Isola
TL;DR
SynCLR demonstrates that fully synthetic data, generated from LLM-produced captions and diffusion-generated images, can yield competitive visual representations without any real data. By defining visual classes at the caption level and combining multi-positive contrastive learning with masked image modeling, it scales to hundreds of millions of captions and demonstrates strong transfer on ImageNet linear evaluation, fine-grained tasks, and ADE20k semantic segmentation. The approach matches or surpasses several real-data baselines while offering scalability and controllability through generative models, and it generalizes better than some peers on unseen concepts. This work highlights learning-from-models as a practical, scalable alternative to real-data collection, with clear avenues for future improvements in caption quality, higher-resolution pretraining, and larger architectures.
Abstract
We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs. The resulting representations transfer well to many downstream tasks, competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks. Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR outperforms previous self-supervised methods by a significant margin, e.g., improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.
