Table of Contents
Fetching ...

DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling

Kyuheon Jung, Yongdeuk Seo, Seongwoo Cho, Jaeyoung Kim, Hyun-seok Min, Sungchul Choi

TL;DR

DALDA tackles data scarcity by jointly leveraging LLMs to enrich text prompts with class-specific semantic information and diffusion-based image synthesis, guided by CLIPScore to adaptively balance image- and text-driven cues. The core contribution is Adaptive Guidance Scaling (AGS), implemented via IP-Adapter cross-attention, which selects the text vs. image emphasis through a truncated-normal sampling of the guidance weight $\\lambda$ based on per-sample CLIPScore. Empirical results on HC and LC few-shot benchmarks show increased synthetic-data diversity and improved downstream accuracies, with strongest gains when using LLM-generated prompts and AGS, while maintaining adherence to the target distribution in challenging low-CLIPScore cases. The work advances practical data augmentation by avoiding extra diffusion-model fine-tuning and providing a principled mechanism to regulate diversity versus distributional fidelity, offering a scalable approach for real-world, data-scarce tasks.

Abstract

In this paper, we present an effective data augmentation framework leveraging the Large Language Model (LLM) and Diffusion Model (DM) to tackle the challenges inherent in data-scarce scenarios. Recently, DMs have opened up the possibility of generating synthetic images to complement a few training images. However, increasing the diversity of synthetic images also raises the risk of generating samples outside the target distribution. Our approach addresses this issue by embedding novel semantic information into text prompts via LLM and utilizing real images as visual prompts, thus generating semantically rich images. To ensure that the generated images remain within the target distribution, we dynamically adjust the guidance weight based on each image's CLIPScore to control the diversity. Experimental results show that our method produces synthetic images with enhanced diversity while maintaining adherence to the target distribution. Consequently, our approach proves to be more efficient in the few-shot setting on several benchmarks. Our code is available at https://github.com/kkyuhun94/dalda .

DALDA: Data Augmentation Leveraging Diffusion Model and LLM with Adaptive Guidance Scaling

TL;DR

DALDA tackles data scarcity by jointly leveraging LLMs to enrich text prompts with class-specific semantic information and diffusion-based image synthesis, guided by CLIPScore to adaptively balance image- and text-driven cues. The core contribution is Adaptive Guidance Scaling (AGS), implemented via IP-Adapter cross-attention, which selects the text vs. image emphasis through a truncated-normal sampling of the guidance weight based on per-sample CLIPScore. Empirical results on HC and LC few-shot benchmarks show increased synthetic-data diversity and improved downstream accuracies, with strongest gains when using LLM-generated prompts and AGS, while maintaining adherence to the target distribution in challenging low-CLIPScore cases. The work advances practical data augmentation by avoiding extra diffusion-model fine-tuning and providing a principled mechanism to regulate diversity versus distributional fidelity, offering a scalable approach for real-world, data-scarce tasks.

Abstract

In this paper, we present an effective data augmentation framework leveraging the Large Language Model (LLM) and Diffusion Model (DM) to tackle the challenges inherent in data-scarce scenarios. Recently, DMs have opened up the possibility of generating synthetic images to complement a few training images. However, increasing the diversity of synthetic images also raises the risk of generating samples outside the target distribution. Our approach addresses this issue by embedding novel semantic information into text prompts via LLM and utilizing real images as visual prompts, thus generating semantically rich images. To ensure that the generated images remain within the target distribution, we dynamically adjust the guidance weight based on each image's CLIPScore to control the diversity. Experimental results show that our method produces synthetic images with enhanced diversity while maintaining adherence to the target distribution. Consequently, our approach proves to be more efficient in the few-shot setting on several benchmarks. Our code is available at https://github.com/kkyuhun94/dalda .
Paper Structure (18 sections, 3 equations, 10 figures, 5 tables)

This paper contains 18 sections, 3 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Overview of the proposed framework. We calculate the CLIPScore for each training image (\ref{['sec:imagescoring']}). We then adaptively adjust the weight $\lambda$ of the prompt. $\lambda$ on samples with low CLIPScore serves to focus on the image guides and vice versa, weighting the text guides more heavily to obtain synthetic data with increased diversity (\ref{['subsec:ags']}). Lastly, text prompts generated by LLM and image prompts are fed into MMDM to generate synthetic images (\ref{['sec:datagen']}).
  • Figure 2: Example of synthetic images varying the $\lambda$. As the weight $\lambda$ gets closer to zero, it becomes more similar to T2I generation.
  • Figure 3: Example of text prompts generated by the LLM.
  • Figure 4: CLIPScore distribution of datasets. Oxford Pets and Caltech-101 belong to the High CLIPScore (HC) group, with images from each class showing high CLIPScores. In contrast, Flowers102 has a higher proportion of classes with low CLIPScores, placing it in the Low CLIPScore (LC) group.
  • Figure 5: Synthetic image comparisons. Base Prompt refers to using the CLIP template, "a photo of a {class}", and LLM Prompt refers to using prompts generated by the LLM. In the case of a low CLIPScore example, a Base Prompt might fail to maintain class consistency (red box), resulting in the generation of synthetic images that do not accurately represent the intended class.
  • ...and 5 more figures