Dataset Augmentation by Mixing Visual Concepts
Abdullah Al Rahat, Hemanth Venkateswara
TL;DR
The paper tackles domain mismatch in diffusion-based data augmentation by introducing Mixing Visual Concepts (MVC), which mixes CLIP-derived caption embeddings to condition a fine-tuned Stable Diffusion model on both text and in-domain image embeddings. The method keeps generated samples within the target domain and enables high diversity without drifting from real data distribution. Extensive experiments across CIFAR-10/100, Tiny ImageNet, Caltech101, and medical imaging show consistent accuracy gains over AutoAugment and RandAugment, with two-phase training providing additional benefits. The work demonstrates a practical approach to domain-aligned augmentation that improves generalization on both broad and specialized tasks.
Abstract
This paper proposes a dataset augmentation method by fine-tuning pre-trained diffusion models. Generating images using a pre-trained diffusion model with textual conditioning often results in domain discrepancy between real data and generated images. We propose a fine-tuning approach where we adapt the diffusion model by conditioning it with real images and novel text embeddings. We introduce a unique procedure called Mixing Visual Concepts (MVC) where we create novel text embeddings from image captions. The MVC enables us to generate multiple images which are diverse and yet similar to the real data enabling us to perform effective dataset augmentation. We perform comprehensive qualitative and quantitative evaluations with the proposed dataset augmentation approach showcasing both coarse-grained and finegrained changes in generated images. Our approach outperforms state-of-the-art augmentation techniques on benchmark classification tasks.
