Table of Contents
Fetching ...

DIAGen: Semantically Diverse Image Augmentation with Generative Models for Few-Shot Learning

Tobias Lingenberg, Markus Reuter, Gopika Sudhakaran, Dominik Gojny, Stefan Roth, Simone Schaub-Meyer

TL;DR

DIAGen targets the semantic diversity gap in standard augmentations for few-shot learning by extending the DA-Fusion pipeline with three components: embedding-space Gaussian noise on learned class embeddings, GPT-4–driven prompts to diversify textual guidance, and a weighting mechanism to mitigate low-fidelity samples. It achieves higher downstream accuracy and recall across four datasets, with notable gains in out-of-distribution and uncommon settings, demonstrating stronger generalization under data scarcity. The approach leverages multi-modal knowledge from diffusion models and LLMs to produce semantically varied yet high-quality synthetic images, making it practical for real-world few-shot applications. Overall, DIAGen offers a scalable, off-the-shelf augmentation that improves robustness in edge cases while balancing fidelity and diversity, with future work to broaden tasks and address model-exposure limitations.

Abstract

Simple data augmentation techniques, such as rotations and flips, are widely used to enhance the generalization power of computer vision models. However, these techniques often fail to modify high-level semantic attributes of a class. To address this limitation, researchers have explored generative augmentation methods like the recently proposed DA-Fusion. Despite some progress, the variations are still largely limited to textural changes, thus falling short on aspects like varied viewpoints, environment, weather conditions, or even class-level semantic attributes (eg, variations in a dog's breed). To overcome this challenge, we propose DIAGen, building upon DA-Fusion. First, we apply Gaussian noise to the embeddings of an object learned with Textual Inversion to diversify generations using a pre-trained diffusion model's knowledge. Second, we exploit the general knowledge of a text-to-text generative model to guide the image generation of the diffusion model with varied class-specific prompts. Finally, we introduce a weighting mechanism to mitigate the impact of poorly generated samples. Experimental results across various datasets show that DIAGen not only enhances semantic diversity but also improves the performance of subsequent classifiers. The advantages of DIAGen over standard augmentations and the DA-Fusion baseline are particularly pronounced with out-of-distribution samples.

DIAGen: Semantically Diverse Image Augmentation with Generative Models for Few-Shot Learning

TL;DR

DIAGen targets the semantic diversity gap in standard augmentations for few-shot learning by extending the DA-Fusion pipeline with three components: embedding-space Gaussian noise on learned class embeddings, GPT-4–driven prompts to diversify textual guidance, and a weighting mechanism to mitigate low-fidelity samples. It achieves higher downstream accuracy and recall across four datasets, with notable gains in out-of-distribution and uncommon settings, demonstrating stronger generalization under data scarcity. The approach leverages multi-modal knowledge from diffusion models and LLMs to produce semantically varied yet high-quality synthetic images, making it practical for real-world few-shot applications. Overall, DIAGen offers a scalable, off-the-shelf augmentation that improves robustness in edge cases while balancing fidelity and diversity, with future work to broaden tasks and address model-exposure limitations.

Abstract

Simple data augmentation techniques, such as rotations and flips, are widely used to enhance the generalization power of computer vision models. However, these techniques often fail to modify high-level semantic attributes of a class. To address this limitation, researchers have explored generative augmentation methods like the recently proposed DA-Fusion. Despite some progress, the variations are still largely limited to textural changes, thus falling short on aspects like varied viewpoints, environment, weather conditions, or even class-level semantic attributes (eg, variations in a dog's breed). To overcome this challenge, we propose DIAGen, building upon DA-Fusion. First, we apply Gaussian noise to the embeddings of an object learned with Textual Inversion to diversify generations using a pre-trained diffusion model's knowledge. Second, we exploit the general knowledge of a text-to-text generative model to guide the image generation of the diffusion model with varied class-specific prompts. Finally, we introduce a weighting mechanism to mitigate the impact of poorly generated samples. Experimental results across various datasets show that DIAGen not only enhances semantic diversity but also improves the performance of subsequent classifiers. The advantages of DIAGen over standard augmentations and the DA-Fusion baseline are particularly pronounced with out-of-distribution samples.
Paper Structure (22 sections, 5 equations, 10 figures, 1 table)

This paper contains 22 sections, 5 equations, 10 figures, 1 table.

Figures (10)

  • Figure 1: Comparison of augmentation results between the baseline method DA-Fusion trabucco2023effective(left) and our proposed approach DIAGen (right), utilizing the same guiding image (middle) for the augmentation process. DIAGen demonstrates superior, semantically diverse image augmentations, as evidenced through more variation of object appearance and settings. This observation is supported by improvements in classification accuracy and recall as a diversity metric kynkaanniemi2019improved.
  • Figure 2: DIAGen's image generation pipeline based on DA-Fusion trabucco2023effective. Our contributions include: a) Varying the learned class concept in the embedding space by applying Gaussian noise. b) Using varied class-specific prompts generated by an LLM. c) Training and utilizing a classifier trained on real images as a weighting mechanism. All real images combined with the generated synthetic ones are then used to train an arbitrary downstream model. The ratio of real to synthetic images can be controlled by the synthetic probability hyperparameter $\alpha$.
  • Figure 3: Downstream classification accuracy of DIAGen, DA-Fusion trabucco2023effective, and standard augmentations on four datasets: (a) FOCUS, (b) MS COCO, (c) Custom COCO dataset, and (d) training on Custom COCO with evaluation on Uncommon Settings test set. Runs marked with * are taken from Trabucco et al.trabucco2023effective.
  • Figure 4: Ablation study of the three proposed components, showcasing their distinct contribution to the classification accuracy. We illustrate the accuracy gains over DA-Fusion trabucco2023effective solely utilizing embedding noise (top left), employing only the LLM prompt module (top right), and combining both (bottom left). Also shown are the improvements by adding the weighting mechanism (bottom right). Latter corresponds to the full DIAGen method.
  • Figure 5: Direct comparison of downstream classifier accuracy between DIAGen (ours) and the baseline method DA-Fusion (our parameters), using the same hyperparameters for a fair evaluation. For reference, the original DA-Fusion method with its parameters from Trabucco et al.trabucco2023effective is also included.
  • ...and 5 more figures