Table of Contents
Fetching ...

Training on Thin Air: Improve Image Classification with Generated Data

Yongchao Zhou, Hshmat Sahak, Jimmy Ba

TL;DR

This work presents Diffusion Inversion, a two-stage method that converts real images into latent embeddings of Stable Diffusion and then samples diverse, high-quality synthetic images by conditioning on noisy variants of these embeddings. By learning per-image conditioning vectors and using classifier-free guidance, the approach achieves 2-3x improvements in sample efficiency and 6.5x faster sampling, surpassing generic prompt-based methods and KNN retrieval across datasets and architectures. The method emphasizes data distribution coverage and quality, shows strong performance in few-shot and domain-shift scenarios, and proves complementary to standard augmentation techniques. Its practical impact lies in enabling efficient, scalable synthetic data augmentation for discriminative learning, particularly in data-scarce or specialized domains.

Abstract

Acquiring high-quality data for training discriminative models is a crucial yet challenging aspect of building effective predictive systems. In this paper, we present Diffusion Inversion, a simple yet effective method that leverages the pre-trained generative model, Stable Diffusion, to generate diverse, high-quality training data for image classification. Our approach captures the original data distribution and ensures data coverage by inverting images to the latent space of Stable Diffusion, and generates diverse novel training images by conditioning the generative model on noisy versions of these vectors. We identify three key components that allow our generated images to successfully supplant the original dataset, leading to a 2-3x enhancement in sample complexity and a 6.5x decrease in sampling time. Moreover, our approach consistently outperforms generic prompt-based steering methods and KNN retrieval baseline across a wide range of datasets. Additionally, we demonstrate the compatibility of our approach with widely-used data augmentation techniques, as well as the reliability of the generated data in supporting various neural architectures and enhancing few-shot learning.

Training on Thin Air: Improve Image Classification with Generated Data

TL;DR

This work presents Diffusion Inversion, a two-stage method that converts real images into latent embeddings of Stable Diffusion and then samples diverse, high-quality synthetic images by conditioning on noisy variants of these embeddings. By learning per-image conditioning vectors and using classifier-free guidance, the approach achieves 2-3x improvements in sample efficiency and 6.5x faster sampling, surpassing generic prompt-based methods and KNN retrieval across datasets and architectures. The method emphasizes data distribution coverage and quality, shows strong performance in few-shot and domain-shift scenarios, and proves complementary to standard augmentation techniques. Its practical impact lies in enabling efficient, scalable synthetic data augmentation for discriminative learning, particularly in data-scarce or specialized domains.

Abstract

Acquiring high-quality data for training discriminative models is a crucial yet challenging aspect of building effective predictive systems. In this paper, we present Diffusion Inversion, a simple yet effective method that leverages the pre-trained generative model, Stable Diffusion, to generate diverse, high-quality training data for image classification. Our approach captures the original data distribution and ensures data coverage by inverting images to the latent space of Stable Diffusion, and generates diverse novel training images by conditioning the generative model on noisy versions of these vectors. We identify three key components that allow our generated images to successfully supplant the original dataset, leading to a 2-3x enhancement in sample complexity and a 6.5x decrease in sampling time. Moreover, our approach consistently outperforms generic prompt-based steering methods and KNN retrieval baseline across a wide range of datasets. Additionally, we demonstrate the compatibility of our approach with widely-used data augmentation techniques, as well as the reliability of the generated data in supporting various neural architectures and enhancing few-shot learning.
Paper Structure (51 sections, 3 equations, 12 figures, 10 tables)

This paper contains 51 sections, 3 equations, 12 figures, 10 tables.

Figures (12)

  • Figure 1: (Left) Our two-stage approach utilizes Stable Diffusion's generalizable knowledge for targeted classification tasks by transforming real images into latent space and generating novel variants through inverse diffusion with perturbed embeddings. (Right) The test accuracy of ResNet18 increases as more generated data is incorporated, eventually exceeding the performance of the model trained on the entire real dataset.
  • Figure 2: Our method optimizes the standard denoising objective to learn a set of embedding vectors while keeping the model parameters fixed.
  • Figure 3: Synthetic images produced by our method: exhibiting diversity, realism, and comprehensive representation of the original dataset, effectively serving as a suitable substitute.
  • Figure 4: Despite the overhead incurred by embedding learning, our method substantially decreases the overall time required to generate numerous images due to improved sampling.
  • Figure 5: Our method outperforms both GAN and GAN Inversion techniques when trained on datasets of equivalent size to the original real dataset, highlighting the significance of a high-quality pre-trained generator.
  • ...and 7 more figures