Table of Contents
Fetching ...

Augmented Conditioning Is Enough For Effective Training Image Generation

Jiahui Chen, Amy Zhang, Adriana Romero-Soriano

TL;DR

The paper addresses the challenge of generating realistic yet diverse synthetic training data for image classification without finetuning diffusion models. It introduces augmentation-conditioned generations, which condition on both a real training image and applied data augmentations to produce in-domain but varied samples, improving downstream classifier performance on long-tail and few-shot benchmarks. Across five standard tasks, the approach yields consistent gains—often surpassing state-of-the-art baselines that rely on more synthetic data or diffusion-model fine-tuning—demonstrating a scalable, cost-effective pathway to leveraging synthetic data for training. This work highlights how simple, classical augmentation strategies can be effectively integrated into diffusion-based generation to enhance training data quality and diversity.

Abstract

Image generation abilities of text-to-image diffusion models have significantly advanced, yielding highly photo-realistic images from descriptive text and increasing the viability of leveraging synthetic images to train computer vision models. To serve as effective training data, generated images must be highly realistic while also sufficiently diverse within the support of the target data distribution. Yet, state-of-the-art conditional image generation models have been primarily optimized for creative applications, prioritizing image realism and prompt adherence over conditional diversity. In this paper, we investigate how to improve the diversity of generated images with the goal of increasing their effectiveness to train downstream image classification models, without fine-tuning the image generation model. We find that conditioning the generation process on an augmented real image and text prompt produces generations that serve as effective synthetic datasets for downstream training. Conditioning on real training images contextualizes the generation process to produce images that are in-domain with the real image distribution, while data augmentations introduce visual diversity that improves the performance of the downstream classifier. We validate augmentation-conditioning on a total of five established long-tail and few-shot image classification benchmarks and show that leveraging augmentations to condition the generation process results in consistent improvements over the state-of-the-art on the long-tailed benchmark and remarkable gains in extreme few-shot regimes of the remaining four benchmarks. These results constitute an important step towards effectively leveraging synthetic data for downstream training.

Augmented Conditioning Is Enough For Effective Training Image Generation

TL;DR

The paper addresses the challenge of generating realistic yet diverse synthetic training data for image classification without finetuning diffusion models. It introduces augmentation-conditioned generations, which condition on both a real training image and applied data augmentations to produce in-domain but varied samples, improving downstream classifier performance on long-tail and few-shot benchmarks. Across five standard tasks, the approach yields consistent gains—often surpassing state-of-the-art baselines that rely on more synthetic data or diffusion-model fine-tuning—demonstrating a scalable, cost-effective pathway to leveraging synthetic data for training. This work highlights how simple, classical augmentation strategies can be effectively integrated into diffusion-based generation to enhance training data quality and diversity.

Abstract

Image generation abilities of text-to-image diffusion models have significantly advanced, yielding highly photo-realistic images from descriptive text and increasing the viability of leveraging synthetic images to train computer vision models. To serve as effective training data, generated images must be highly realistic while also sufficiently diverse within the support of the target data distribution. Yet, state-of-the-art conditional image generation models have been primarily optimized for creative applications, prioritizing image realism and prompt adherence over conditional diversity. In this paper, we investigate how to improve the diversity of generated images with the goal of increasing their effectiveness to train downstream image classification models, without fine-tuning the image generation model. We find that conditioning the generation process on an augmented real image and text prompt produces generations that serve as effective synthetic datasets for downstream training. Conditioning on real training images contextualizes the generation process to produce images that are in-domain with the real image distribution, while data augmentations introduce visual diversity that improves the performance of the downstream classifier. We validate augmentation-conditioning on a total of five established long-tail and few-shot image classification benchmarks and show that leveraging augmentations to condition the generation process results in consistent improvements over the state-of-the-art on the long-tailed benchmark and remarkable gains in extreme few-shot regimes of the remaining four benchmarks. These results constitute an important step towards effectively leveraging synthetic data for downstream training.

Paper Structure

This paper contains 27 sections, 1 equation, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Example images from (a) real training data, (b) a pretrained diffusion model using the class label as conditioning, (c) the best performing augmentation-conditioned method. Augmentation conditioning generates visually diverse, realistic images that enhance downstream classification accuracy when used as training data.
  • Figure 2: Our augmentation-conditioned generation conditions the reverse diffusion process on the class label and an augmented real image, introducing visual diversity that improves the performance of the downstream classifier.
  • Figure 3: Failed generations: Semantic Errors (a),(b) where generations using only the class label result in images depicting a totally different object; Visual Domain Shift (c),(d) where generations using only the class label produce the correct visual concept but in a distinctly different visual style. Both these failure cases reduce efficacy of synthetic training images and are remedied by generating images conditioned on the class label and real training images.
  • Figure 4: Sample generated images using all of the augmentation conditioning methods. (a) shows generations conditioned on just the image and generations conditioned on Dropout applied to the image (b) shows generations conditioned on the combination of 2 images produced by the specified augmentation method. Augmentation-conditioned generations show more visual diversity in the coloration, pose, and angle of the hamster. Generations from Embed-CutMix-Dropout, which yields the highest accuracy on ImageNet-LT, have distinct background diversity with hamsters depicted in various realistic terrains.
  • Figure 5: Classifier free guidance scale's effect on few-shot classification performance. Across all datasets, fine-tuning on images generated with 10.0 CFG scale yields better performance.
  • ...and 3 more figures