Table of Contents
Fetching ...

Context-guided Responsible Data Augmentation with Diffusion Models

Khawar Islam, Naveed Akhtar

TL;DR

This work addresses the challenge of unreliable content in diffusion-model-based data augmentation for vision tasks. It proposes DiffCoRe-Mix, a context-guided, text-to-image diffusion augmentation framework that uses contextual and negative prompts to steer generation and a CLIP-based hard filtration to ensure semantic alignment with real images. Generated samples are mixed with real images at both pixel- and patch-level to enhance generalization, achieving notable gains across six diverse datasets. The approach demonstrates competitive computational overhead and broad applicability, highlighting its practical potential for robust data augmentation in real-world vision systems.

Abstract

Generative diffusion models offer a natural choice for data augmentation when training complex vision models. However, ensuring reliability of their generative content as augmentation samples remains an open challenge. Despite a number of techniques utilizing generative images to strengthen model training, it remains unclear how to utilize the combination of natural and generative images as a rich supervisory signal for effective model induction. In this regard, we propose a text-to-image (T2I) data augmentation method, named DiffCoRe-Mix, that computes a set of generative counterparts for a training sample with an explicitly constrained diffusion model that leverages sample-based context and negative prompting for a reliable augmentation sample generation. To preserve key semantic axes, we also filter out undesired generative samples in our augmentation process. To that end, we propose a hard-cosine filtration in the embedding space of CLIP. Our approach systematically mixes the natural and generative images at pixel and patch levels. We extensively evaluate our technique on ImageNet-1K,Tiny ImageNet-200, CIFAR-100, Flowers102, CUB-Birds, Stanford Cars, and Caltech datasets, demonstrating a notable increase in performance across the board, achieving up to $\sim 3\%$ absolute gain for top-1 accuracy over the state-of-the-art methods, while showing comparable computational overhead. Our code is publicly available at https://github.com/khawar-islam/DiffCoRe-Mix

Context-guided Responsible Data Augmentation with Diffusion Models

TL;DR

This work addresses the challenge of unreliable content in diffusion-model-based data augmentation for vision tasks. It proposes DiffCoRe-Mix, a context-guided, text-to-image diffusion augmentation framework that uses contextual and negative prompts to steer generation and a CLIP-based hard filtration to ensure semantic alignment with real images. Generated samples are mixed with real images at both pixel- and patch-level to enhance generalization, achieving notable gains across six diverse datasets. The approach demonstrates competitive computational overhead and broad applicability, highlighting its practical potential for robust data augmentation in real-world vision systems.

Abstract

Generative diffusion models offer a natural choice for data augmentation when training complex vision models. However, ensuring reliability of their generative content as augmentation samples remains an open challenge. Despite a number of techniques utilizing generative images to strengthen model training, it remains unclear how to utilize the combination of natural and generative images as a rich supervisory signal for effective model induction. In this regard, we propose a text-to-image (T2I) data augmentation method, named DiffCoRe-Mix, that computes a set of generative counterparts for a training sample with an explicitly constrained diffusion model that leverages sample-based context and negative prompting for a reliable augmentation sample generation. To preserve key semantic axes, we also filter out undesired generative samples in our augmentation process. To that end, we propose a hard-cosine filtration in the embedding space of CLIP. Our approach systematically mixes the natural and generative images at pixel and patch levels. We extensively evaluate our technique on ImageNet-1K,Tiny ImageNet-200, CIFAR-100, Flowers102, CUB-Birds, Stanford Cars, and Caltech datasets, demonstrating a notable increase in performance across the board, achieving up to absolute gain for top-1 accuracy over the state-of-the-art methods, while showing comparable computational overhead. Our code is publicly available at https://github.com/khawar-islam/DiffCoRe-Mix

Paper Structure

This paper contains 12 sections, 11 equations, 15 figures, 8 tables.

Figures (15)

  • Figure 1: (Top) The proposed DiffCoRe-Mix employs a T2I model constrained with contextual and negative prompts. The output of the T2I model is filtered, and image-mixing is employed to introduce better generalization and robustness.(Bottom) The closest generative image-mixing method islam2024diffusemix uses an I2I model with style prompt to edit the image by concatenating original and generative image.
  • Figure 2: Overview of DiffCoRe-Mix data augmentation method. It takes an input image from dataset to generate a image guided by our contextual and negative prompts. CLIP-based image encoder is utilized to extract features from original and generative image. Then, our hard-cosine filtration approach is used to verify the semantic alignment between the original and generative features. We filter out unaligned images, and mix pixel- and patch-level real and generative images.
  • Figure 3: Representative context guided generative images. Despite strong (positive and negative) context guidance, generated images may contain a small fraction ($\sim 10\%$ as confirmed by results in § \ref{['sec:FAD']}) of samples that semantically do not align well with the original images.
  • Figure 4: Augmentation overhead (+%) - accuracy (%) plot on CUB dataset with batch size 16. The closer the value to the upper left corner, the better the augmentation strategy.
  • Figure 5: Representative saliency visualizations on original data samples. Our method guides the model to more precisely focus on the target object in the image.
  • ...and 10 more figures