The Unmet Promise of Synthetic Training Images: Using Retrieved Real Images Performs Better
Scott Geng, Cheng-Yu Hsieh, Vivek Ramanujan, Matthew Wallingford, Chun-Liang Li, Pang Wei Koh, Ranjay Krishna
TL;DR
The paper investigates whether synthetic training images provide information beyond directly using targeted real images retrieved from the generator's upstream data. It compares finetuning a pretrained CLIP model on targeted synthetic data from Stable Diffusion 1.5 (trained on LAION-2B) versus targeted real images retrieved from LAION-2B across five visual benchmarks. The main finding is that retrieval-based real data consistently matches or outperforms synthetic data at equivalent scales, with synthetic data showing limited gains and sometimes harming performance due to generator artifacts and distortions of task-relevant details. The authors emphasize that a simple, strong retrieval baseline should be considered when evaluating synthetic-data methods and discuss directions for improving synthesis by focusing on missing composites or constraining data sources, as well as practical scenarios where retrieval is not feasible.
Abstract
Generative text-to-image models enable us to synthesize unlimited amounts of images in a controllable manner, spurring many recent efforts to train vision models with synthetic data. However, every synthetic image ultimately originates from the upstream data used to train the generator. Does the intermediate generator provide additional information over directly training on relevant parts of the upstream data? Grounding this question in the setting of image classification, we compare finetuning on task-relevant, targeted synthetic data generated by Stable Diffusion -- a generative model trained on the LAION-2B dataset -- against finetuning on targeted real images retrieved directly from LAION-2B. We show that while synthetic data can benefit some downstream tasks, it is universally matched or outperformed by real data from the simple retrieval baseline. Our analysis suggests that this underperformance is partially due to generator artifacts and inaccurate task-relevant visual details in the synthetic images. Overall, we argue that targeted retrieval is a critical baseline to consider when training with synthetic data -- a baseline that current methods do not yet surpass. We release code, data, and models at https://github.com/scottgeng00/unmet-promise.
