Table of Contents
Fetching ...

Diffusion-Based Data Augmentation for Image Recognition: A Systematic Analysis and Evaluation

Zekun Li, Yinghuan Shi, Yang Gao, Dong Xu

TL;DR

UniDiffDA is introduced, a unified analytical framework that decomposes DiffDA methods into three core components: model fine-tuning, sample generation, and sample utilization, and develops a comprehensive and fair evaluation protocol, benchmarking representative DiffDA methods across diverse low-data classification tasks.

Abstract

Diffusion-based data augmentation (DiffDA) has emerged as a promising approach to improving classification performance under data scarcity. However, existing works vary significantly in task configurations, model choices, and experimental pipelines, making it difficult to fairly compare methods or assess their effectiveness across different scenarios. Moreover, there remains a lack of systematic understanding of the full DiffDA workflow. In this work, we introduce UniDiffDA, a unified analytical framework that decomposes DiffDA methods into three core components: model fine-tuning, sample generation, and sample utilization. This perspective enables us to identify key differences among existing methods and clarify the overall design space. Building on this framework, we develop a comprehensive and fair evaluation protocol, benchmarking representative DiffDA methods across diverse low-data classification tasks. Extensive experiments reveal the relative strengths and limitations of different DiffDA strategies and offer practical insights into method design and deployment. All methods are re-implemented within a unified codebase, with full release of code and configurations to ensure reproducibility and to facilitate future research.

Diffusion-Based Data Augmentation for Image Recognition: A Systematic Analysis and Evaluation

TL;DR

UniDiffDA is introduced, a unified analytical framework that decomposes DiffDA methods into three core components: model fine-tuning, sample generation, and sample utilization, and develops a comprehensive and fair evaluation protocol, benchmarking representative DiffDA methods across diverse low-data classification tasks.

Abstract

Diffusion-based data augmentation (DiffDA) has emerged as a promising approach to improving classification performance under data scarcity. However, existing works vary significantly in task configurations, model choices, and experimental pipelines, making it difficult to fairly compare methods or assess their effectiveness across different scenarios. Moreover, there remains a lack of systematic understanding of the full DiffDA workflow. In this work, we introduce UniDiffDA, a unified analytical framework that decomposes DiffDA methods into three core components: model fine-tuning, sample generation, and sample utilization. This perspective enables us to identify key differences among existing methods and clarify the overall design space. Building on this framework, we develop a comprehensive and fair evaluation protocol, benchmarking representative DiffDA methods across diverse low-data classification tasks. Extensive experiments reveal the relative strengths and limitations of different DiffDA strategies and offer practical insights into method design and deployment. All methods are re-implemented within a unified codebase, with full release of code and configurations to ensure reproducibility and to facilitate future research.
Paper Structure (23 sections, 5 equations, 16 figures, 19 tables)

This paper contains 23 sections, 5 equations, 16 figures, 19 tables.

Figures (16)

  • Figure 1: Illustration of three diffusion-based image-to-image transition techniques used in representative DiffDA methods: (a) SDEdit, (b) InstructPix2Pix, and (c) DDIM Inversion and Interpolation.
  • Figure 2: Illustration of the fine-tuning pipeline combining Textual Inversion and DreamBooth-LoRA. A pseudo-token (e.g., <Sage_Thrasher>) is learned and inserted into the text prompt to synthesize class-specific images. The corresponding embedding is optimized while keeping the rest of the text encoder frozen. Meanwhile, LoRA modules are inserted into the UNet and fine-tuned to better capture target-domain visual features. The generated image is trained to match the real target image via reconstruction loss. Tunable parameters are highlighted in orange.
  • Figure 3: Illustration of the UniDiffDA framework, which decomposes diffusion-based data augmentation (DiffDA) workflow into three core components: (1) Model Fine-tuning, where the diffusion model is optionally adapted to the target domain using real images; (2) Sample Generation, where the model synthesizes new samples guided by real data; and (3) Sample Utilization, where synthetic samples are either concatenated with or used to replace real samples for classifier training.
  • Figure 4: Examples of synthetic "Beaver" images from CIFAR100 and "Wild Boar" images from ImageNet-100.
  • Figure 5: Examples of fine-grained concept generation with untuned diffusion models.
  • ...and 11 more figures