Diffusion Self-Distillation for Zero-Shot Customized Image Generation
Shengqu Cai, Eric Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, Gordon Wetzstein
TL;DR
This work tackles the challenge of fine-grained identity-preserving control in text-to-image diffusion without test-time optimization. It introduces Diffusion Self-Distillation, which automatically generates a large, identity-consistent paired dataset using a teacher model, prompts from LLMs, and VLM-based curation, then finetunes the model into an image-conditioned, two-frame generator. The approach yields state-of-the-art zero-shot customization across diverse identities and contexts, matching or approaching per-instance tuning while preserving efficiency. Empirical results, including GPT-based evaluations and a user study, demonstrate strong identity preservation, prompt fidelity, and creative diversity, with potential broad impact on digital art, comics, and character design.
Abstract
Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.
