Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images
Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee
TL;DR
The paper tackles the problem of scaling visual recognition with synthetic data without fine-tuning diffusion models. It introduces a pipeline that uses LLMs and CLIP for label ambiguity resolution, contextual and style diversification to broaden synthetic prompts, and domain-adaptive training with auxiliary batch normalization to bridge real and synthetic data distributions. Empirical results show robust in-domain and out-of-domain improvements across CNNs and vision transformers, scalable gains up to 6x the real data, and strong performance in low-data and long-tail settings. The work demonstrates that synthetic data, when generated with diverse, semantically aligned prompts and proper training strategies, can substantially enhance large-scale recognition systems and generalization without expensive model fine-tuning.
Abstract
Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In this paper, we explore whether generative fine-tuning is essential for this improvement and whether it is possible to further scale up training using more synthetic data. We present a new framework leveraging off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently enhances recognition model performance with more synthetic data, up to 6x of original ImageNet size showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.
