Table of Contents
Fetching ...

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

Shengqu Cai, Eric Chan, Yunzhi Zhang, Leonidas Guibas, Jiajun Wu, Gordon Wetzstein

TL;DR

This work tackles the challenge of fine-grained identity-preserving control in text-to-image diffusion without test-time optimization. It introduces Diffusion Self-Distillation, which automatically generates a large, identity-consistent paired dataset using a teacher model, prompts from LLMs, and VLM-based curation, then finetunes the model into an image-conditioned, two-frame generator. The approach yields state-of-the-art zero-shot customization across diverse identities and contexts, matching or approaching per-instance tuning while preserving efficiency. Empirical results, including GPT-based evaluations and a user study, demonstrate strong identity preservation, prompt fidelity, and creative diversity, with potential broad impact on digital art, comics, and character design.

Abstract

Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.

Diffusion Self-Distillation for Zero-Shot Customized Image Generation

TL;DR

This work tackles the challenge of fine-grained identity-preserving control in text-to-image diffusion without test-time optimization. It introduces Diffusion Self-Distillation, which automatically generates a large, identity-consistent paired dataset using a teacher model, prompts from LLMs, and VLM-based curation, then finetunes the model into an image-conditioned, two-frame generator. The approach yields state-of-the-art zero-shot customization across diverse identities and contexts, matching or approaching per-instance tuning while preserving efficiency. Empirical results, including GPT-based evaluations and a user study, demonstrate strong identity preservation, prompt fidelity, and creative diversity, with potential broad impact on digital art, comics, and character design.

Abstract

Text-to-image diffusion models produce impressive results but are frustrating tools for artists who desire fine-grained control. For example, a common use case is to create images of a specific instance in novel contexts, i.e., "identity-preserving generation". This setting, along with many other tasks (e.g., relighting), is a natural fit for image+text-conditional generative models. However, there is insufficient high-quality paired data to train such a model directly. We propose Diffusion Self-Distillation, a method for using a pre-trained text-to-image model to generate its own dataset for text-conditioned image-to-image tasks. We first leverage a text-to-image diffusion model's in-context generation ability to create grids of images and curate a large paired dataset with the help of a Visual-Language Model. We then fine-tune the text-to-image model into a text+image-to-image model using the curated paired dataset. We demonstrate that Diffusion Self-Distillation outperforms existing zero-shot methods and is competitive with per-instance tuning techniques on a wide range of identity-preservation generation tasks, without requiring test-time optimization.

Paper Structure

This paper contains 35 sections, 20 figures, 2 tables.

Figures (20)

  • Figure 1: Given an input image, Diffusion Self-Distillation is a novel diffusion-based approach that generates diverse images that maintain the input's identity across various contexts. Unlike prior approaches that require fine-tuning or are limited to specific domains, Diffusion Self-Distillation offers instant customization without any additional inference-stage training, enabling precise control and editability in text-to-image diffusion models. This ability makes Diffusion Self-Distillation a valuable tool for general AI content creation.
  • Figure 2: Overview of our pipeline.Left: the top shows our vanilla paired data generation wheel (Sec. \ref{['sec:data_generation']}). We first sample reference image captions from the LAION schuhmann2022laion dataset. These reference captions are parsed through an LLM to be translated into identity-preserved grid generation prompts (Sec. \ref{['sec:prompt_generation']}). We feed these enhanced prompts to a pretrained text-to-image diffusion model to sample potentially identity-preserved grids of images, which are then cropped and composed into vanilla image pairs (Sec. \ref{['sec:vanilla_data_generation']}). On the bottom, we show our data curation pipeline (Sec. \ref{['sec:data_curation']}), where the vanilla image paired are fed into a VLM to classify whether they depict identical main subjects. This process mimics a human annotation/curation process while being fully automatic; we use the curated data as our final training data. Right: we extend the diffusion transformer model into an image-conditioned framework by treating the input image as the first frame of a two-frame sequence. The model generates both frames simultaneously—the first reconstructs the input, while the second is the edited output—allowing effective information exchange between the conditioning image and the desired output.
  • Figure 3: Difference between structure-preserving and identity-preserving edits. In structure-preserving editing, the main structures of the image are preserved, and only local edits or stylizations are performed. In identity-preserving editing, the global structure of the image may change radically.
  • Figure 4: Qualitative comparison. Overall, our method achieves high subject identity preservation and prompt-aligned diversity while not suffering from a "copy-paste" effect, such as the results of IP-Adapter+ ye2023ipadapter. This is largely thanks to our supervised training pipeline, which alleviates the base model's in-context generation ability.
  • Figure 5: Qualitative result. Our Diffusion Self-Distillation is capable of various customization targets across different tasks and styles, for instance, characters or objects, photorealistic or animated. Diffusion Self-Distillation can also take instruction types of prompts as input, similar to InstructPix2Pix brooks2022instructpix2pix. Further, our model exhibits relighting capabilities without significantly altering the scene's content.
  • ...and 15 more figures