Table of Contents
Fetching ...

Dataset Augmentation by Mixing Visual Concepts

Abdullah Al Rahat, Hemanth Venkateswara

TL;DR

The paper tackles domain mismatch in diffusion-based data augmentation by introducing Mixing Visual Concepts (MVC), which mixes CLIP-derived caption embeddings to condition a fine-tuned Stable Diffusion model on both text and in-domain image embeddings. The method keeps generated samples within the target domain and enables high diversity without drifting from real data distribution. Extensive experiments across CIFAR-10/100, Tiny ImageNet, Caltech101, and medical imaging show consistent accuracy gains over AutoAugment and RandAugment, with two-phase training providing additional benefits. The work demonstrates a practical approach to domain-aligned augmentation that improves generalization on both broad and specialized tasks.

Abstract

This paper proposes a dataset augmentation method by fine-tuning pre-trained diffusion models. Generating images using a pre-trained diffusion model with textual conditioning often results in domain discrepancy between real data and generated images. We propose a fine-tuning approach where we adapt the diffusion model by conditioning it with real images and novel text embeddings. We introduce a unique procedure called Mixing Visual Concepts (MVC) where we create novel text embeddings from image captions. The MVC enables us to generate multiple images which are diverse and yet similar to the real data enabling us to perform effective dataset augmentation. We perform comprehensive qualitative and quantitative evaluations with the proposed dataset augmentation approach showcasing both coarse-grained and finegrained changes in generated images. Our approach outperforms state-of-the-art augmentation techniques on benchmark classification tasks.

Dataset Augmentation by Mixing Visual Concepts

TL;DR

The paper tackles domain mismatch in diffusion-based data augmentation by introducing Mixing Visual Concepts (MVC), which mixes CLIP-derived caption embeddings to condition a fine-tuned Stable Diffusion model on both text and in-domain image embeddings. The method keeps generated samples within the target domain and enables high diversity without drifting from real data distribution. Extensive experiments across CIFAR-10/100, Tiny ImageNet, Caltech101, and medical imaging show consistent accuracy gains over AutoAugment and RandAugment, with two-phase training providing additional benefits. The work demonstrates a practical approach to domain-aligned augmentation that improves generalization on both broad and specialized tasks.

Abstract

This paper proposes a dataset augmentation method by fine-tuning pre-trained diffusion models. Generating images using a pre-trained diffusion model with textual conditioning often results in domain discrepancy between real data and generated images. We propose a fine-tuning approach where we adapt the diffusion model by conditioning it with real images and novel text embeddings. We introduce a unique procedure called Mixing Visual Concepts (MVC) where we create novel text embeddings from image captions. The MVC enables us to generate multiple images which are diverse and yet similar to the real data enabling us to perform effective dataset augmentation. We perform comprehensive qualitative and quantitative evaluations with the proposed dataset augmentation approach showcasing both coarse-grained and finegrained changes in generated images. Our approach outperforms state-of-the-art augmentation techniques on benchmark classification tasks.

Paper Structure

This paper contains 13 sections, 6 equations, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Naïvely deploying a pre-trained generative model to generate new images for dataset augmentation can lead to domain discrepancy. Columns 1, 2 are MRI scans from the Brain Tumor Dataset 1jny-g144-23, Columns 3, 4 are images generated using a pre-trained Stable Diffusion (SD) model, Columns 5, 6 are images generated using the proposed MVC method.
  • Figure 2: The results of coarse and fine mixing. (a) real image, (b) generated images conditioned on the real image along with coarse mixing, (c) generated images conditioned on the real image along with only fine mixing.
  • Figure 3: The training procedure illustrated with an image of a Cat: We fine-tune the pre-trained Stable Diffusion (SD U-Net) model. Input 'Cat' image is used to generate noisy latent $z_t$. The 'Conditional Image' is used to generate image conditioning $e_{{\mathcal{I}}}$ which is concatenated with $z_t$. Input image caption generated by BLIP-2, is concatenated with a user-provided prompt like for e.g., "a photo of a Cat". The captions of all 'Cat' images are stored in ${\mathcal{C}}$ and are used to generate text conditioning $e_{{\mathcal{T}}}$ with the MVC algorithm. The SD U-Net is trained using the objective in Equation \ref{['Eq:LDM']}.
  • Figure 4: An overview of image generation: We begin with a complete noisy latent $z_T \sim {\mathcal{N}}(0,I)$. To generate an 'Ewer'-like image we use the Conditional Image to generate an image embedding $e_{{\mathcal{I}}}$ and concatenate it with $z_T$. We apply the MVC algorithm on the pool of image captions of 'Ewer' to obtain the text conditioning $e_{{\mathcal{T}}}$. We apply the denoising procedure in Equation \ref{['Eq:denoise']} to estimate $z_{T-1}$ from $z_T$. We iterate this procedure for $T\geq t\geq 1$ to arrive at $z_0$ - the denoised latent. We apply the decoder to obtain the new image $\hat{x} \gets D(z_0)$.
  • Figure 5: Column 1 depicts real images from CIFAR-100. In the remaining columns, the first two rows are images generated using the proposed MVC method, which constrains generated images to be in-domain. The third row represents images generated using a pre-trained SD model, which provides more diversity with no control over in-domain generation.
  • ...and 1 more figures