Table of Contents
Fetching ...

Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation

Yara Bahram, Melodie Desbos, Mohammadhadi Shateri, Eric Granger

TL;DR

Uni-DAD proposes a single-stage framework that jointly distills and adapts diffusion models for fast, few-shot image generation in novel domains. By coupling a dual-domain distribution-matching objective with a multi-head GAN, and optionally leveraging a target teacher, it preserves source-domain diversity while sharpening target realism. Empirical results on few-shot image generation and subject-driven personalization show superior quality and diversity with as few as 3 denoising steps, offering a practical path to real-time personalized diffusion-based generation. The method is checkpoint-agnostic, enabling distillation of adapted models or adaptation of distilled ones without changing the training loop, and demonstrates strong potential for fast, high-fidelity, domain-shifted generation.

Abstract

Diffusion models (DMs) produce high-quality images, yet their sampling remains costly when adapted to new domains. Distilled DMs are faster but typically remain confined within their teacher's domain. Thus, fast and high-quality generation for novel domains relies on two-stage training pipelines: Adapt-then-Distill or Distill-then-Adapt. However, both add design complexity and suffer from degraded quality or diversity. We introduce Uni-DAD, a single-stage pipeline that unifies distillation and adaptation of DMs. It couples two signals during training: (i) a dual-domain distribution-matching distillation objective that guides the student toward the distributions of the source teacher and a target teacher, and (ii) a multi-head generative adversarial network (GAN) loss that encourages target realism across multiple feature scales. The source domain distillation preserves diverse source knowledge, while the multi-head GAN stabilizes training and reduces overfitting, especially in few-shot regimes. The inclusion of a target teacher facilitates adaptation to more structurally distant domains. We perform evaluations on a variety of datasets for few-shot image generation (FSIG) and subject-driven personalization (SDP). Uni-DAD delivers higher quality than state-of-the-art (SoTA) adaptation methods even with less than 4 sampling steps, and outperforms two-stage training pipelines in both quality and diversity.

Uni-DAD: Unified Distillation and Adaptation of Diffusion Models for Few-step Few-shot Image Generation

TL;DR

Uni-DAD proposes a single-stage framework that jointly distills and adapts diffusion models for fast, few-shot image generation in novel domains. By coupling a dual-domain distribution-matching objective with a multi-head GAN, and optionally leveraging a target teacher, it preserves source-domain diversity while sharpening target realism. Empirical results on few-shot image generation and subject-driven personalization show superior quality and diversity with as few as 3 denoising steps, offering a practical path to real-time personalized diffusion-based generation. The method is checkpoint-agnostic, enabling distillation of adapted models or adaptation of distilled ones without changing the training loop, and demonstrates strong potential for fast, high-fidelity, domain-shifted generation.

Abstract

Diffusion models (DMs) produce high-quality images, yet their sampling remains costly when adapted to new domains. Distilled DMs are faster but typically remain confined within their teacher's domain. Thus, fast and high-quality generation for novel domains relies on two-stage training pipelines: Adapt-then-Distill or Distill-then-Adapt. However, both add design complexity and suffer from degraded quality or diversity. We introduce Uni-DAD, a single-stage pipeline that unifies distillation and adaptation of DMs. It couples two signals during training: (i) a dual-domain distribution-matching distillation objective that guides the student toward the distributions of the source teacher and a target teacher, and (ii) a multi-head generative adversarial network (GAN) loss that encourages target realism across multiple feature scales. The source domain distillation preserves diverse source knowledge, while the multi-head GAN stabilizes training and reduces overfitting, especially in few-shot regimes. The inclusion of a target teacher facilitates adaptation to more structurally distant domains. We perform evaluations on a variety of datasets for few-shot image generation (FSIG) and subject-driven personalization (SDP). Uni-DAD delivers higher quality than state-of-the-art (SoTA) adaptation methods even with less than 4 sampling steps, and outperforms two-stage training pipelines in both quality and diversity.

Paper Structure

This paper contains 28 sections, 10 equations, 10 figures, 7 tables, 1 algorithm.

Figures (10)

  • Figure 1: TML]FF0000FFFFFF Uni-DAD (Distill & Adapt) vs. two-stage pipelines, TML]B3B3B3FFFFFF Distill-then-Adapt , and TML]D6B656FFFFFF Adapt-then-Distill . Adapt is performed by fine-tuning, and Distill by DMD2 yin2024improved. The source domain is represented by 70K diverse faces, and the target domain by 10 babies. Sampling steps are reduced from 25 to 3.
  • Figure 2: Overview of Uni-DAD for few-step and few-shot image generation. A (frozen) HTML]CCCCCC source teacher $\epsilon^{\text{src}}$ is adapted and distilled into a HTML]FFD24A student $G$ for fast sampling ($1\leq\text{NFEs}\leq4$) on the target domain. At each training iteration, Uni-DAD alternates among three updates: (1) Student: optimize $G$ with a dual-domain DMD objective on $\epsilon^{\text{src}}$ and HTML]FF5656 target teacher $\epsilon^{\text{trg}}$, plus a GAN generator loss; (2) Fake teacher and discriminator: train a HTML]DD9692 fake teacher $\epsilon^{\text{fk}}$ on student generations and train a HTML]7C9DE8 multi-head discriminator $D$ to distinguish target images from student generations; (3) Target teacher update: train $\epsilon^{\text{trg}}$ on target images.
  • Figure 3: Sensitivity analysis of sample quality to NFE.
  • Figure 4: Qualitative ablation of the dual-domain DMD weighting factor $a$ on Babies and MetFaces.
  • Figure 5: Qualitative comparison for 10-shot adaptation from a guided DM dhariwal2021diffusion pretrained on FFHQ karras2019style to target domains of varying proximity to the source. Generated samples are randomly picked. Zoom in for details.
  • ...and 5 more figures