Table of Contents
Fetching ...

Distribution-Aware Data Expansion with Diffusion Models

Haowei Zhu, Ling Yang, Jun-Hai Yong, Hongzhi Yin, Jiawei Jiang, Meng Xiao, Wentao Zhang, Bin Wang

TL;DR

DistDiff tackles data scarcity by enabling distribution-consistent data expansion without re-training diffusion models. It constructs class-level ${p}_{\mathrm c}$ and group-level ${p}_{\mathrm g}$ prototypes to approximate the real data distribution and uses energy guidance during diffusion sampling, operating on intermediate latent steps via ${z}_{0|t}$ to refine samples. The method delivers superior downstream accuracy across six datasets, outperforming both transformation- and synthesis-based baselines and proving compatible with standard augmentation pipelines; it also demonstrates robustness across architectures. By reducing distribution drift and eliminating the need for extensive retraining, DistDiff offers a practical, scalable approach to data augmentation for diverse domains.

Abstract

The scale and quality of a dataset significantly impact the performance of deep models. However, acquiring large-scale annotated datasets is both a costly and time-consuming endeavor. To address this challenge, dataset expansion technologies aim to automatically augment datasets, unlocking the full potential of deep models. Current data expansion techniques include image transformation and image synthesis methods. Transformation-based methods introduce only local variations, leading to limited diversity. In contrast, synthesis-based methods generate entirely new content, greatly enhancing informativeness. However, existing synthesis methods carry the risk of distribution deviations, potentially degrading model performance with out-of-distribution samples. In this paper, we propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model. DistDiff constructs hierarchical prototypes to approximate the real data distribution, optimizing latent data points within diffusion models with hierarchical energy guidance. We demonstrate its capability to generate distribution-consistent samples, significantly improving data expansion tasks. DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data. Furthermore, our approach consistently outperforms existing synthesis-based techniques and demonstrates compatibility with widely adopted transformation-based augmentation methods. Additionally, the expanded dataset exhibits robustness across various architectural frameworks. Our code is available at https://github.com/haoweiz23/DistDiff

Distribution-Aware Data Expansion with Diffusion Models

TL;DR

DistDiff tackles data scarcity by enabling distribution-consistent data expansion without re-training diffusion models. It constructs class-level and group-level prototypes to approximate the real data distribution and uses energy guidance during diffusion sampling, operating on intermediate latent steps via to refine samples. The method delivers superior downstream accuracy across six datasets, outperforming both transformation- and synthesis-based baselines and proving compatible with standard augmentation pipelines; it also demonstrates robustness across architectures. By reducing distribution drift and eliminating the need for extensive retraining, DistDiff offers a practical, scalable approach to data augmentation for diverse domains.

Abstract

The scale and quality of a dataset significantly impact the performance of deep models. However, acquiring large-scale annotated datasets is both a costly and time-consuming endeavor. To address this challenge, dataset expansion technologies aim to automatically augment datasets, unlocking the full potential of deep models. Current data expansion techniques include image transformation and image synthesis methods. Transformation-based methods introduce only local variations, leading to limited diversity. In contrast, synthesis-based methods generate entirely new content, greatly enhancing informativeness. However, existing synthesis methods carry the risk of distribution deviations, potentially degrading model performance with out-of-distribution samples. In this paper, we propose DistDiff, a training-free data expansion framework based on the distribution-aware diffusion model. DistDiff constructs hierarchical prototypes to approximate the real data distribution, optimizing latent data points within diffusion models with hierarchical energy guidance. We demonstrate its capability to generate distribution-consistent samples, significantly improving data expansion tasks. DistDiff consistently enhances accuracy across a diverse range of datasets compared to models trained solely on original data. Furthermore, our approach consistently outperforms existing synthesis-based techniques and demonstrates compatibility with widely adopted transformation-based augmentation methods. Additionally, the expanded dataset exhibits robustness across various architectural frameworks. Our code is available at https://github.com/haoweiz23/DistDiff
Paper Structure (45 sections, 6 equations, 9 figures, 12 tables, 1 algorithm)

This paper contains 45 sections, 6 equations, 9 figures, 12 tables, 1 algorithm.

Figures (9)

  • Figure 1: A comparison unveils distinctions between conventional data expansion methods and our innovative distribution-aware diffusion framework, benefiting from hierarchical clustering and multi-step energy guidance.
  • Figure 2: Overview of the DistDiff pipeline. DistDiff enhances the generation process in diffusion models with distribution-aware optimization. It approximates the real data distribution using hierarchical prototypes ${\bm{p}}_c$ and ${\bm{p}}_g$, optimizing the sampling process through distribution-aware energy guidance. Subsequently, original generated data point ${\bm{z}}_t$ is refined for improved alignment with the real distribution.
  • Figure 3: Our method outperforms state-of-the-art data expansion methods when trained on expanded datasets, underscoring the importance of a high-quality generator in training a classifier.
  • Figure 4: Performance comparison across different scale data sizes. Our method demonstrates significant improvements in classification model performance in both low-data and large-scale data scenarios, outperforming the transformation method AutoAug and the synthesized method Stable Diffusion 1.4.
  • Figure 5: The visualization of synthetic samples generated by our method, showcasing high fidelity, diversity, and alignment with the original data distribution.
  • ...and 4 more figures