Table of Contents
Fetching ...

Structure-Guided Adversarial Training of Diffusion Models

Ling Yang, Haotian Qian, Zhilong Zhang, Jingwei Liu, Bin Cui

TL;DR

To ensure the model captures authentic manifold structures in the data distribution, adversarial training of the diffusion generator against a novel structure discriminator in a minimax game is advocated, distinguishing real manifold structures from the generated ones.

Abstract

Diffusion models have demonstrated exceptional efficacy in various generative applications. While existing models focus on minimizing a weighted sum of denoising score matching losses for data distribution modeling, their training primarily emphasizes instance-level optimization, overlooking valuable structural information within each mini-batch, indicative of pair-wise relationships among samples. To address this limitation, we introduce Structure-guided Adversarial training of Diffusion Models (SADM). In this pioneering approach, we compel the model to learn manifold structures between samples in each training batch. To ensure the model captures authentic manifold structures in the data distribution, we advocate adversarial training of the diffusion generator against a novel structure discriminator in a minimax game, distinguishing real manifold structures from the generated ones. SADM substantially improves existing diffusion transformers (DiT) and outperforms existing methods in image generation and cross-domain fine-tuning tasks across 12 datasets, establishing a new state-of-the-art FID of 1.58 and 2.11 on ImageNet for class-conditional image generation at resolutions of 256x256 and 512x512, respectively.

Structure-Guided Adversarial Training of Diffusion Models

TL;DR

To ensure the model captures authentic manifold structures in the data distribution, adversarial training of the diffusion generator against a novel structure discriminator in a minimax game is advocated, distinguishing real manifold structures from the generated ones.

Abstract

Diffusion models have demonstrated exceptional efficacy in various generative applications. While existing models focus on minimizing a weighted sum of denoising score matching losses for data distribution modeling, their training primarily emphasizes instance-level optimization, overlooking valuable structural information within each mini-batch, indicative of pair-wise relationships among samples. To address this limitation, we introduce Structure-guided Adversarial training of Diffusion Models (SADM). In this pioneering approach, we compel the model to learn manifold structures between samples in each training batch. To ensure the model captures authentic manifold structures in the data distribution, we advocate adversarial training of the diffusion generator against a novel structure discriminator in a minimax game, distinguishing real manifold structures from the generated ones. SADM substantially improves existing diffusion transformers (DiT) and outperforms existing methods in image generation and cross-domain fine-tuning tasks across 12 datasets, establishing a new state-of-the-art FID of 1.58 and 2.11 on ImageNet for class-conditional image generation at resolutions of 256x256 and 512x512, respectively.
Paper Structure (47 sections, 14 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 47 sections, 14 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison between previous instance-level training and our structure-guided training for diffusion models.
  • Figure 2: Generated samples on ImageNet $256\times 256$ with (i) ADM with Classifier Guidance (ADM-G) dhariwal2021diffusion, (ii) ADM optimized by our Structure-guided Adversarial Training, and (iii) real samples in ground truth classes. We can significantly improve diffusion models qualitatively and quantitatively, and our generated sample distribution is overally more similar to real sample distribution. See \ref{['app-samples']} for more synthesis samples of our SOTA model.
  • Figure 3: Overview of SADM. We minimize the structural distance between the generated samples (fake) and ground truth samples (real) in the manifold space for optimizing the denoiser, and maximize their structural distance for optimizing the encoder in structure discriminator. The denoiser and the structure discriminator are adversarially trained.
  • Figure 4: Qualitative comparion with ADM-G dhariwal2021diffusion and previous SOTA method MDT gao2023masked. Our SADM can synthesize more realistic and high-quality samples while maintaining satisfying diversity.
  • Figure 5: Heatmap visualization with 8 denoised samples.
  • ...and 7 more figures