Table of Contents
Fetching ...

The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better

Scott Geng, Cheng-Yu Hsieh, Vivek Ramanujan, Matthew Wallingford, Chun-Liang Li, Pang Wei Koh, Ranjay Krishna

TL;DR

The paper investigates whether synthetic training images provide information beyond directly using targeted real images retrieved from the generator's upstream data. It compares finetuning a pretrained CLIP model on targeted synthetic data from Stable Diffusion 1.5 (trained on LAION-2B) versus targeted real images retrieved from LAION-2B across five visual benchmarks. The main finding is that retrieval-based real data consistently matches or outperforms synthetic data at equivalent scales, with synthetic data showing limited gains and sometimes harming performance due to generator artifacts and distortions of task-relevant details. The authors emphasize that a simple, strong retrieval baseline should be considered when evaluating synthetic-data methods and discuss directions for improving synthesis by focusing on missing composites or constraining data sources, as well as practical scenarios where retrieval is not feasible.

Abstract

Generative text-to-image models enable us to synthesize unlimited amounts of images in a controllable manner, spurring many recent efforts to train vision models with synthetic data. However, every synthetic image ultimately originates from the upstream data used to train the generator. Does the intermediate generator provide additional information over directly training on relevant parts of the upstream data? Grounding this question in the setting of image classification, we compare finetuning on task-relevant, targeted synthetic data generated by Stable Diffusion -- a generative model trained on the LAION-2B dataset -- against finetuning on targeted real images retrieved directly from LAION-2B. We show that while synthetic data can benefit some downstream tasks, it is universally matched or outperformed by real data from the simple retrieval baseline. Our analysis suggests that this underperformance is partially due to generator artifacts and inaccurate task-relevant visual details in the synthetic images. Overall, we argue that targeted retrieval is a critical baseline to consider when training with synthetic data -- a baseline that current methods do not yet surpass. We release code, data, and models at https://github.com/scottgeng00/unmet-promise.

The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better

TL;DR

The paper investigates whether synthetic training images provide information beyond directly using targeted real images retrieved from the generator's upstream data. It compares finetuning a pretrained CLIP model on targeted synthetic data from Stable Diffusion 1.5 (trained on LAION-2B) versus targeted real images retrieved from LAION-2B across five visual benchmarks. The main finding is that retrieval-based real data consistently matches or outperforms synthetic data at equivalent scales, with synthetic data showing limited gains and sometimes harming performance due to generator artifacts and distortions of task-relevant details. The authors emphasize that a simple, strong retrieval baseline should be considered when evaluating synthetic-data methods and discuss directions for improving synthesis by focusing on missing composites or constraining data sources, as well as practical scenarios where retrieval is not feasible.

Abstract

Generative text-to-image models enable us to synthesize unlimited amounts of images in a controllable manner, spurring many recent efforts to train vision models with synthetic data. However, every synthetic image ultimately originates from the upstream data used to train the generator. Does the intermediate generator provide additional information over directly training on relevant parts of the upstream data? Grounding this question in the setting of image classification, we compare finetuning on task-relevant, targeted synthetic data generated by Stable Diffusion -- a generative model trained on the LAION-2B dataset -- against finetuning on targeted real images retrieved directly from LAION-2B. We show that while synthetic data can benefit some downstream tasks, it is universally matched or outperformed by real data from the simple retrieval baseline. Our analysis suggests that this underperformance is partially due to generator artifacts and inaccurate task-relevant visual details in the synthetic images. Overall, we argue that targeted retrieval is a critical baseline to consider when training with synthetic data -- a baseline that current methods do not yet surpass. We release code, data, and models at https://github.com/scottgeng00/unmet-promise.
Paper Structure (47 sections, 4 equations, 10 figures, 1 table)

This paper contains 47 sections, 4 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Given an upstream dataset of general real image-text pairs, we aim to curate a targeted dataset to train a learner on some target task. We can either (1) retrieve targeted real images directly from the upstream dataset, or we can (2) first train an intermediate generative model and then synthesize targeted synthetic images. By comparing these two approaches, our paper seeks to measure what value training on generated synthetic data adds.
  • Figure 2: We adapt a pretrained CLIP image encoder (dashed purple line) to different downstream image classification tasks, using either (a) targeted synthetic data (orange triangles) generated from a Stable Diffusion model trained on LAION-2B or using (b) targeted real data (blue circles) directly retrieved from LAION-2B. We measure performance via downstream zero-shot (ZS) and linear probing (LP) accuracy, aggregating results over at least 3 seeds (error bars indicate $\pm 1$ standard deviation). Overall, while adapting CLIP with targeted synthetic data can sometimes improve performance over an off-the-shelf model, synthetic data is universally outperformed or matched by targeted real data. This gap persists even when we scale the sample size of the synthetic adaptation dataset beyond the maximum amount of (finite) targeted real data considered (gray shaded regions).
  • Figure 3: We visualize retrieved real images and synthetic images from our targeted adaptation datasets for FGVC-Aircraft (top two rows) and ImageNet-1K (bottom two rows), alongside ground truth images (left column) for reference. Compared to retrieved images, synthetic images often (1) contain generator artifacts (e.g., the blur on the edges of the "Cessna 172", the eyes and mouth of the "Tabby Cat") and also (2) distort class-relevant visual content, such as the engine configuration of a true "Airbus A320" (i.e., exactly one engine per wing) and the entire visual appearance of a "Flute". We hypothesize that both factors contribute to synthetic training data's underperformance versus real training data.
  • Figure 4: We use Stable Diffusion to synthetically perturb real images according to a noise strength parameter $\gamma \in [0,1]$, where larger $\gamma$ increases the severity of generator-specific artifacts added by the perturbation. When $\gamma \geq 0.6$, the introduced artifacts can be strong enough to damage task-relevant visual details for finegrained tasks like FGVC-Aircraft (e.g., the airplane's engine and rear wheels). For broad tasks like ImageNet, artifacts have a lesser impact on class-relevant details; the "Tabby Cat" is recognizable as a cat even after perturbing with high $\gamma$.
  • Figure 5: We finetune a pretrained CLIP model (dashed purple line) on retrieved real images that are synthetically perturbed (green circles) with Stable Diffusion to introduce generator artifacts. The perturbation strength is controlled by a parameter $\gamma \in [0,1]$ where larger $\gamma$ introduces stronger artifacts; within the gray-shaded region, the artifacts are strong enough to damage class-relevant details. Our results suggest that generator artifacts do contribute to synthetic data's underperformance---any artifact level causes performance to drop below training on retrieved images (dashed blue line). Moreover, differences in visual content between synthetic and retrieved images also matter; even with relatively strong perturbations ($\gamma=0.5$), training on artifact-afflicted perturbed images that retain the semantic content of retrieved images outperforms training on synthetic images (dashed orange line).
  • ...and 5 more figures