Table of Contents
Fetching ...

Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation

Pu Cao, Feng Zhou, Lu Yang, Tianrui Huang, Qing Song

TL;DR

The paper tackles in-domain generation by adapting large diffusion priors to domain data using only image data, addressing the fidelity-controllability tradeoff that emerges during fine-tuning. It introduces a guidance-decoupled prior preservation framework that splits conditional guidance into domain guidance (learned from a domain-specific diffusion model) and control guidance (preserved by diffusion priors), along with an efficient domain-knowledge learner based on a null-text diffusion model. A multi-guidance synthesis pipeline combines domain, unconditional, and control priors with carefully initialized weights, enabling high-fidelity, controllable generation across faces, animals, and porcelain, and extending to editing and text-to-3D tasks while remaining compatible with existing control methods. The findings show improved domain alignment (FID), preserved controllability (text/spatial controls), and broad applicability to diverse tasks, suggesting scalable, label-free domain adaptation of diffusion models for in-domain generation. Overall, the work advances practical diffusion-model personalization by separating domain-specific learning from open-world guidance and leveraging priors to rectify drift during domain adaptation.

Abstract

In-domain generation aims to perform a variety of tasks within a specific domain, such as unconditional generation, text-to-image, image editing, 3D generation, and more. Early research typically required training specialized generators for each unique task and domain, often relying on fully-labeled data. Motivated by the powerful generative capabilities and broad applications of diffusion models, we are driven to explore leveraging label-free data to empower these models for in-domain generation. Fine-tuning a pre-trained generative model on domain data is an intuitive but challenging way and often requires complex manual hyper-parameter adjustments since the limited diversity of the training data can easily disrupt the model's original generative capabilities. To address this challenge, we propose a guidance-decoupled prior preservation mechanism to achieve high generative quality and controllability by image-only data, inspired by preserving the pre-trained model from a denoising guidance perspective. We decouple domain-related guidance from the conditional guidance used in classifier-free guidance mechanisms to preserve open-world control guidance and unconditional guidance from the pre-trained model. We further propose an efficient domain knowledge learning technique to train an additional text-free UNet copy to predict domain guidance. Besides, we theoretically illustrate a multi-guidance in-domain generation pipeline for a variety of generative tasks, leveraging multiple guidances from distinct diffusion models and conditions. Extensive experiments demonstrate the superiority of our method in domain-specific synthesis and its compatibility with various diffusion-based control methods and applications.

Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation

TL;DR

The paper tackles in-domain generation by adapting large diffusion priors to domain data using only image data, addressing the fidelity-controllability tradeoff that emerges during fine-tuning. It introduces a guidance-decoupled prior preservation framework that splits conditional guidance into domain guidance (learned from a domain-specific diffusion model) and control guidance (preserved by diffusion priors), along with an efficient domain-knowledge learner based on a null-text diffusion model. A multi-guidance synthesis pipeline combines domain, unconditional, and control priors with carefully initialized weights, enabling high-fidelity, controllable generation across faces, animals, and porcelain, and extending to editing and text-to-3D tasks while remaining compatible with existing control methods. The findings show improved domain alignment (FID), preserved controllability (text/spatial controls), and broad applicability to diverse tasks, suggesting scalable, label-free domain adaptation of diffusion models for in-domain generation. Overall, the work advances practical diffusion-model personalization by separating domain-specific learning from open-world guidance and leveraging priors to rectify drift during domain adaptation.

Abstract

In-domain generation aims to perform a variety of tasks within a specific domain, such as unconditional generation, text-to-image, image editing, 3D generation, and more. Early research typically required training specialized generators for each unique task and domain, often relying on fully-labeled data. Motivated by the powerful generative capabilities and broad applications of diffusion models, we are driven to explore leveraging label-free data to empower these models for in-domain generation. Fine-tuning a pre-trained generative model on domain data is an intuitive but challenging way and often requires complex manual hyper-parameter adjustments since the limited diversity of the training data can easily disrupt the model's original generative capabilities. To address this challenge, we propose a guidance-decoupled prior preservation mechanism to achieve high generative quality and controllability by image-only data, inspired by preserving the pre-trained model from a denoising guidance perspective. We decouple domain-related guidance from the conditional guidance used in classifier-free guidance mechanisms to preserve open-world control guidance and unconditional guidance from the pre-trained model. We further propose an efficient domain knowledge learning technique to train an additional text-free UNet copy to predict domain guidance. Besides, we theoretically illustrate a multi-guidance in-domain generation pipeline for a variety of generative tasks, leveraging multiple guidances from distinct diffusion models and conditions. Extensive experiments demonstrate the superiority of our method in domain-specific synthesis and its compatibility with various diffusion-based control methods and applications.
Paper Structure (27 sections, 8 equations, 19 figures, 2 tables)

This paper contains 27 sections, 8 equations, 19 figures, 2 tables.

Figures (19)

  • Figure 1: Illutration of In-domain Generation. In this work, we empower large-scale pre-trained diffusion models using only image data to perform varied generation tasks within each domain with high fidelity and controllability. We mark the input data in blue and the results generated with the original Stable Diffusion v1.5 model in orange.
  • Figure 2: Challenge of fine-tuning diffusion models on domain data. We show the fine-tuning process of Stable Diffusion v1.5 with a facial image dataset (i.e., FFHQkarras2019style). The open-world controllability of pre-trained diffusion models is gradually decreased during the fine-tuning process, with domain fidelity improved.
  • Figure 3: Illustration of Fine-tuning Process. We demonstrate the guidance catastrophic forgetting during fine-tuning process.
  • Figure 4: Unconditional Guidance Drift. The unconditional generation results of fine-tuned diffusion models reflect the visual pattern of training datasets, which would cause inaccurate noise estimation.
  • Figure 5: Conditional Guidance Decoupling. We compare the guidance estimation between previous customization methods and ours. We decouple the conditional guidance into domain guidance and control guidance while predicting the control guidance and unconditional guidance using the original diffusion model to keep them unchanged.
  • ...and 14 more figures

Theorems & Definitions (1)

  • proof