Learning Vision from Models Rivals Learning Vision from Data

Yonglong Tian; Lijie Fan; Kaifeng Chen; Dina Katabi; Dilip Krishnan; Phillip Isola

Learning Vision from Models Rivals Learning Vision from Data

Yonglong Tian, Lijie Fan, Kaifeng Chen, Dina Katabi, Dilip Krishnan, Phillip Isola

TL;DR

SynCLR demonstrates that fully synthetic data, generated from LLM-produced captions and diffusion-generated images, can yield competitive visual representations without any real data. By defining visual classes at the caption level and combining multi-positive contrastive learning with masked image modeling, it scales to hundreds of millions of captions and demonstrates strong transfer on ImageNet linear evaluation, fine-grained tasks, and ADE20k semantic segmentation. The approach matches or surpasses several real-data baselines while offering scalability and controllability through generative models, and it generalizes better than some peers on unseen concepts. This work highlights learning-from-models as a practical, scalable alternative to real-data collection, with clear avenues for future improvements in caption quality, higher-resolution pretraining, and larger architectures.

Abstract

We introduce SynCLR, a novel approach for learning visual representations exclusively from synthetic images and synthetic captions, without any real data. We synthesize a large dataset of image captions using LLMs, then use an off-the-shelf text-to-image model to generate multiple images corresponding to each synthetic caption. We perform visual representation learning on these synthetic images via contrastive learning, treating images sharing the same caption as positive pairs. The resulting representations transfer well to many downstream tasks, competing favorably with other general-purpose visual representation learners such as CLIP and DINO v2 in image classification tasks. Furthermore, in dense prediction tasks such as semantic segmentation, SynCLR outperforms previous self-supervised methods by a significant margin, e.g., improving over MAE and iBOT by 6.2 and 4.3 mIoU on ADE20k for ViT-B/16.

Learning Vision from Models Rivals Learning Vision from Data

TL;DR

Abstract

Paper Structure (20 sections, 3 equations, 7 figures, 15 tables)

This paper contains 20 sections, 3 equations, 7 figures, 15 tables.

Introduction
Related Works
Approach
Synthesizing captions
Synthesizing Images
Representation Learning
Implementation
Experiment
Study different components
Scaling up
Further analysis
Discussions and Conclusion
Concept Sampling
Implementation Details
Pre-training
...and 5 more sections

Figures (7)

Figure 1: Three paradigms for visual representation learning. Top row: Traditional methods, such as CLIP clip, learn only from real data; Middle row: Recent methods, such as StableRep stablerep, learn from real text and generated images; Bottom row: Our method, SynCLR, learns from synthetic text and synthetic images, and rival the linear transfer performance of CLIP on ImageNet despite not directly observing any real data.
Figure 2: Different learning objectives treat classification granularity differently. These images are generated by two prompts "a golden retriever, wearing sunglasses and a beach hat, rides a bike" and "a cute golden retriever sits in a house made of sushi". SimCLR treats each image as a class, while supervised cross-entropy treats them all as the same "golden retrieval" class. The former does not consider shared semantics between images, and the latter is coarse-grained and ignores actions or relationships between subjects/background. Our approach, SynCLR, defines visual classes by sentences.
Figure 3: In-context caption generation using Llama-2 llama2. We randomly sample three in-context examples for each inference run.
Figure 4: Random examples of synthetic captions and images generated in our SynCLR pipeline. Each caption comes with 4 images.
Figure 5: PCA visualization. Follow DINO v2 dinov2, we compute a PCA between the patches of the images from the same set and colorize by their first 3 components. Compared to DINO v2, SynCLR produces more accurate maps for cars (e.g., zoom-in to see the two bars on the roof of the first car, and the three side windows of the third car) and airplanes (e.g., the boundaries), while being slightly worse for dogs (e.g., heads). We use ViT-L/14 for both methods. Images are resized to 336x448 resolution before being fed into the networks, yielding 24x32 visualization grids.
...and 2 more figures

Learning Vision from Models Rivals Learning Vision from Data

TL;DR

Abstract

Learning Vision from Models Rivals Learning Vision from Data

Authors

TL;DR

Abstract

Table of Contents

Figures (7)