Table of Contents
Fetching ...

Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

Zhuoran Yu, Chenchen Zhu, Sean Culatana, Raghuraman Krishnamoorthi, Fanyi Xiao, Yong Jae Lee

TL;DR

The paper tackles the problem of scaling visual recognition with synthetic data without fine-tuning diffusion models. It introduces a pipeline that uses LLMs and CLIP for label ambiguity resolution, contextual and style diversification to broaden synthetic prompts, and domain-adaptive training with auxiliary batch normalization to bridge real and synthetic data distributions. Empirical results show robust in-domain and out-of-domain improvements across CNNs and vision transformers, scalable gains up to 6x the real data, and strong performance in low-data and long-tail settings. The work demonstrates that synthetic data, when generated with diverse, semantically aligned prompts and proper training strategies, can substantially enhance large-scale recognition systems and generalization without expensive model fine-tuning.

Abstract

Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In this paper, we explore whether generative fine-tuning is essential for this improvement and whether it is possible to further scale up training using more synthetic data. We present a new framework leveraging off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently enhances recognition model performance with more synthetic data, up to 6x of original ImageNet size showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.

Diversify, Don't Fine-Tune: Scaling Up Visual Recognition Training with Synthetic Images

TL;DR

The paper tackles the problem of scaling visual recognition with synthetic data without fine-tuning diffusion models. It introduces a pipeline that uses LLMs and CLIP for label ambiguity resolution, contextual and style diversification to broaden synthetic prompts, and domain-adaptive training with auxiliary batch normalization to bridge real and synthetic data distributions. Empirical results show robust in-domain and out-of-domain improvements across CNNs and vision transformers, scalable gains up to 6x the real data, and strong performance in low-data and long-tail settings. The work demonstrates that synthetic data, when generated with diverse, semantically aligned prompts and proper training strategies, can substantially enhance large-scale recognition systems and generalization without expensive model fine-tuning.

Abstract

Recent advances in generative deep learning have enabled the creation of high-quality synthetic images in text-to-image generation. Prior work shows that fine-tuning a pretrained diffusion model on ImageNet and generating synthetic training images from the finetuned model can enhance an ImageNet classifier's performance. However, performance degrades as synthetic images outnumber real ones. In this paper, we explore whether generative fine-tuning is essential for this improvement and whether it is possible to further scale up training using more synthetic data. We present a new framework leveraging off-the-shelf generative models to generate synthetic training images, addressing multiple challenges: class name ambiguity, lack of diversity in naive prompts, and domain shifts. Specifically, we leverage large language models (LLMs) and CLIP to resolve class name ambiguity. To diversify images, we propose contextualized diversification (CD) and stylized diversification (SD) methods, also prompted by LLMs. Finally, to mitigate domain shifts, we leverage domain adaptation techniques with auxiliary batch normalization for synthetic images. Our framework consistently enhances recognition model performance with more synthetic data, up to 6x of original ImageNet size showcasing the potential of synthetic data for improved recognition models and strong out-of-domain generalization.
Paper Structure (18 sections, 1 equation, 8 figures, 9 tables)

This paper contains 18 sections, 1 equation, 8 figures, 9 tables.

Figures (8)

  • Figure 1: Prompt augmentation for unambiguous and diversified synthetic data generation. We improve synthetic data generation by augmenting prompts from two perspectives: 1) we resolve the ambiguity for class names to avoid generating images with incorrect semantics for the target class (e.g., row A vs. B), and 2) we diversify the prompts used to generate synthetic images both in terms of their contexts (row C) and styles (row D). We automatically achieve both augmentations with LLMs.
  • Figure 2: Our synthetic data generation pipeline. (a) Generating synthetic data with naive prompts can lead to incorrect semantics for classes with ambiguous names (e.g., the bird vs. the machine for "crane"). (b) Our Label Ambiguity Resolution (LAR) procedure resolves ambiguity in labels while preserving similar semantics in the generated images. (c) Our diversification procedure includes contextual diversification (CD) and style diversification (SD) by prompting an LLM to produce contextualized descriptions of images featuring class $c$ ("crane" in this example) that combines different aspects (indicated by different colors in the figure): foreground objects, background objects, lighting condition, camera angle, and different styles.
  • Figure 3: Label Ambiguity Resolution. Given multiple meanings of class name c (crane in this example), we leverage CLIP to compute average similarity between each meaning and training examples of class c. Then, we use averaged image-text similarity as selection metric to select correct meaning of the c. In this example, a large long-legged bird is selected and used as additional context in the rest of our pipeline.
  • Figure 4: Top-1 ImageNet Classification Accuracy vs Synthetic Data Size. In contrast to findings in previous work azizi2023synthetic (blue line), our method is able to scale up recognition training, with consistent accuracy improvement as the number of unique synthetic samples increases.
  • Figure 5: Qualitative comparison between our diversified synthetic images and synthetic images from azizi2023synthetic. Since azizi2023synthetic does not share their finetuned model and synthetic data, we use the provided qualitative results in their manuscript for analysis. Synthetic images from azizi2023synthetic lacks of diversity in both foreground and background whereas our diversified images shows diverse foreground objects in different postures and camera angles and different background environment. The diverse semantic information avoids recognition model overfitting to specific details of synthetic data and prevents performance degradation when scaling up synthetic images.
  • ...and 3 more figures