Image is All You Need to Empower Large-scale Diffusion Models for In-Domain Generation
Pu Cao, Feng Zhou, Lu Yang, Tianrui Huang, Qing Song
TL;DR
The paper tackles in-domain generation by adapting large diffusion priors to domain data using only image data, addressing the fidelity-controllability tradeoff that emerges during fine-tuning. It introduces a guidance-decoupled prior preservation framework that splits conditional guidance into domain guidance (learned from a domain-specific diffusion model) and control guidance (preserved by diffusion priors), along with an efficient domain-knowledge learner based on a null-text diffusion model. A multi-guidance synthesis pipeline combines domain, unconditional, and control priors with carefully initialized weights, enabling high-fidelity, controllable generation across faces, animals, and porcelain, and extending to editing and text-to-3D tasks while remaining compatible with existing control methods. The findings show improved domain alignment (FID), preserved controllability (text/spatial controls), and broad applicability to diverse tasks, suggesting scalable, label-free domain adaptation of diffusion models for in-domain generation. Overall, the work advances practical diffusion-model personalization by separating domain-specific learning from open-world guidance and leveraging priors to rectify drift during domain adaptation.
Abstract
In-domain generation aims to perform a variety of tasks within a specific domain, such as unconditional generation, text-to-image, image editing, 3D generation, and more. Early research typically required training specialized generators for each unique task and domain, often relying on fully-labeled data. Motivated by the powerful generative capabilities and broad applications of diffusion models, we are driven to explore leveraging label-free data to empower these models for in-domain generation. Fine-tuning a pre-trained generative model on domain data is an intuitive but challenging way and often requires complex manual hyper-parameter adjustments since the limited diversity of the training data can easily disrupt the model's original generative capabilities. To address this challenge, we propose a guidance-decoupled prior preservation mechanism to achieve high generative quality and controllability by image-only data, inspired by preserving the pre-trained model from a denoising guidance perspective. We decouple domain-related guidance from the conditional guidance used in classifier-free guidance mechanisms to preserve open-world control guidance and unconditional guidance from the pre-trained model. We further propose an efficient domain knowledge learning technique to train an additional text-free UNet copy to predict domain guidance. Besides, we theoretically illustrate a multi-guidance in-domain generation pipeline for a variety of generative tasks, leveraging multiple guidances from distinct diffusion models and conditions. Extensive experiments demonstrate the superiority of our method in domain-specific synthesis and its compatibility with various diffusion-based control methods and applications.
