Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation

Pu Cao; Feng Zhou; Lu Yang; Tianrui Huang; Qing Song

Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation

Pu Cao, Feng Zhou, Lu Yang, Tianrui Huang, Qing Song

TL;DR

The paper tackles in-domain generation by adapting large diffusion priors to domain data using only image data, addressing the fidelity-controllability tradeoff that emerges during fine-tuning. It introduces a guidance-decoupled prior preservation framework that splits conditional guidance into domain guidance (learned from a domain-specific diffusion model) and control guidance (preserved by diffusion priors), along with an efficient domain-knowledge learner based on a null-text diffusion model. A multi-guidance synthesis pipeline combines domain, unconditional, and control priors with carefully initialized weights, enabling high-fidelity, controllable generation across faces, animals, and porcelain, and extending to editing and text-to-3D tasks while remaining compatible with existing control methods. The findings show improved domain alignment (FID), preserved controllability (text/spatial controls), and broad applicability to diverse tasks, suggesting scalable, label-free domain adaptation of diffusion models for in-domain generation. Overall, the work advances practical diffusion-model personalization by separating domain-specific learning from open-world guidance and leveraging priors to rectify drift during domain adaptation.

Abstract

In-domain generation aims to perform a variety of tasks within a specific domain, such as unconditional generation, text-to-image, image editing, 3D generation, and more. Early research typically required training specialized generators for each unique task and domain, often relying on fully-labeled data. Motivated by the powerful generative capabilities and broad applications of diffusion models, we are driven to explore leveraging label-free data to empower these models for in-domain generation. Fine-tuning a pre-trained generative model on domain data is an intuitive but challenging way and often requires complex manual hyper-parameter adjustments since the limited diversity of the training data can easily disrupt the model's original generative capabilities. To address this challenge, we propose a guidance-decoupled prior preservation mechanism to achieve high generative quality and controllability by image-only data, inspired by preserving the pre-trained model from a denoising guidance perspective. We decouple domain-related guidance from the conditional guidance used in classifier-free guidance mechanisms to preserve open-world control guidance and unconditional guidance from the pre-trained model. We further propose an efficient domain knowledge learning technique to train an additional text-free UNet copy to predict domain guidance. Besides, we theoretically illustrate a multi-guidance in-domain generation pipeline for a variety of generative tasks, leveraging multiple guidances from distinct diffusion models and conditions. Extensive experiments demonstrate the superiority of our method in domain-specific synthesis and its compatibility with various diffusion-based control methods and applications.

Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation

TL;DR

Abstract

Paper Structure (27 sections, 8 equations, 19 figures, 2 tables)

This paper contains 27 sections, 8 equations, 19 figures, 2 tables.

Introduction
Related Works
Method
Guidance Catastrophic Forgetting
Guidance-Decoupled Prior Preservation
Efficient Domain Knowledge Learning
In-domain Generation with Multi-Guidance
Experiments
Experimental Settings
Results of In-domain Image Generation
Results of Other In-domain Generation Tasks
Comparisons to Fine-tuning Techniques
Effects of Unconditional Guidance Rectification
Conclusion
Acknowledgements
...and 12 more sections

Figures (19)

Figure 1: Illutration of In-domain Generation. In this work, we empower large-scale pre-trained diffusion models using only image data to perform varied generation tasks within each domain with high fidelity and controllability. We mark the input data in blue and the results generated with the original Stable Diffusion v1.5 model in orange.
Figure 2: Challenge of fine-tuning diffusion models on domain data. We show the fine-tuning process of Stable Diffusion v1.5 with a facial image dataset (i.e., FFHQkarras2019style). The open-world controllability of pre-trained diffusion models is gradually decreased during the fine-tuning process, with domain fidelity improved.
Figure 3: Illustration of Fine-tuning Process. We demonstrate the guidance catastrophic forgetting during fine-tuning process.
Figure 4: Unconditional Guidance Drift. The unconditional generation results of fine-tuned diffusion models reflect the visual pattern of training datasets, which would cause inaccurate noise estimation.
Figure 5: Conditional Guidance Decoupling. We compare the guidance estimation between previous customization methods and ours. We decouple the conditional guidance into domain guidance and control guidance while predicting the control guidance and unconditional guidance using the original diffusion model to keep them unchanged.
...and 14 more figures

Theorems & Definitions (1)

proof

Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation

TL;DR

Abstract

Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (19)

Theorems & Definitions (1)