Diffusion Guided Domain Adaptation of Image Generators

Kunpeng Song; Ligong Han; Bingchen Liu; Dimitris Metaxas; Ahmed Elgammal

Diffusion Guided Domain Adaptation of Image Generators

Kunpeng Song, Ligong Han, Bingchen Liu, Dimitris Metaxas, Ahmed Elgammal

TL;DR

This work addresses zero-shot domain adaptation for pre-trained image generators by leveraging Score Distillation Sampling (SDS) from large text-to-image diffusion models as a training objective. It introduces diffusion-guided domain adaptation for StyleGAN2 by distilling diffusion priors into the generator, without requiring target-domain ground-truth images, and tackles mode collapse with a diffusion directional regularizer and a reconstruction regularizer. The approach delivers strong qualitative and quantitative gains, including significantly improved FID and competitive CLIP scores, especially on long prompts, and extends to 3D-aware generators (EG3D) and DreamBooth guidance. The method offers a controllable, scalable path to align generators with diverse text-described domains, enabling rapid, high-quality cross-domain synthesis with practical impact for creative AI systems.

Abstract

Can a text-to-image diffusion model be used as a training objective for adapting a GAN generator to another domain? In this paper, we show that the classifier-free guidance can be leveraged as a critic and enable generators to distill knowledge from large-scale text-to-image diffusion models. Generators can be efficiently shifted into new domains indicated by text prompts without access to groundtruth samples from target domains. We demonstrate the effectiveness and controllability of our method through extensive experiments. Although not trained to minimize CLIP loss, our model achieves equally high CLIP scores and significantly lower FID than prior work on short prompts, and outperforms the baseline qualitatively and quantitatively on long and complicated prompts. To our best knowledge, the proposed method is the first attempt at incorporating large-scale pre-trained diffusion models and distillation sampling for text-driven image generator domain adaptation and gives a quality previously beyond possible. Moreover, we extend our work to 3D-aware style-based generators and DreamBooth guidance.

Diffusion Guided Domain Adaptation of Image Generators

TL;DR

Abstract

Paper Structure (22 sections, 6 equations, 22 figures, 5 tables)

This paper contains 22 sections, 6 equations, 22 figures, 5 tables.

Introduction
Related Work
Methods
Background
Model Structure and Diffusion Guidance Loss
Directional and Reconstruction Regularizer
Timestep Range and Layer Selection
Experiments
Results
Comparison with Baseline
Long Text Prompts
Quantitative Evaluation
Timestep Range and Layer Selection
Directional and Reconstruction Regularizer
Extension to 3D-Aware Generative Models
...and 7 more sections

Figures (22)

Figure 1: Example images after adapting generator to a domain specified by a text description. The first section is a photo from the FFHQ dataset to 3D stylized Anime, the second section is from cats to 3D rendering cats. Detailed text prompts can be founded in the appendix.
Figure 2: Overview of our StyleGAN-Fusion framework. The style-based generator $\mathcal{G}_\phi$ receives the gradient $\frac{\partial\mathcal{L}}{\partial \mathbf{x}}$ backpropagated from $\frac{\partial\mathcal{L}}{\partial \mathbf{z}}$ through encoder $\mathcal{E}$. $\hat{{\boldsymbol{\epsilon}}}_{\theta,\mathbf{c},\mathbf{z}_t}$ is the classifier-free guidance score. All noises and noisy images are the decoded corresponding latents for visualization purposes.
Figure 3: Generated images from experiments on FFHQ face, AFHQ-Cat and Dog DBLP:journals/corr/abs-1912-01865. The text below each section is the driving prompt. Notice our model only takes in a target prompt and does not need the source prompt.
Figure 4: Generated image after adapting StyleGAN2-Car NEURIPS2019_9015 model to new domains indicated by the prompts.
Figure 5: Compare our method with the baseline on "FFHQ face to 3D-stylized face". The text prompt is long, requesting a Pixar rendering with a cinematic smooth texture, reflective eyes, and 3D lighting. Notice the eyes in the baseline are not as big, beautiful, and reflective as ours. The baseline also contains unexpected textures, conflicting with the requirement. Our results better match the prompt and have more realistic 3D lighting.
...and 17 more figures

Diffusion Guided Domain Adaptation of Image Generators

TL;DR

Abstract

Diffusion Guided Domain Adaptation of Image Generators

Authors

TL;DR

Abstract

Table of Contents

Figures (22)