Table of Contents
Fetching ...

An Improved Method for Personalizing Diffusion Models

Yan Zeng, Masanori Suganuma, Takayuki Okatani

TL;DR

This paper tackles the challenge of personalizing diffusion-based image generation to a specific object using only a few examples. It proposes a two-stage method that first learns a rare-adjective token embedding and then fine-tunes the diffusion model without prior preservation, freezing the text encoder and the embedding during stage two. Compared with Textual Inversion and Dreambooth, the approach yields higher fidelity with substantially less training time and mitigates language drift and forgetting. Quantitative results using CLIP and DINO scores, along with qualitative assessments, show consistent improvements over baseline personalization methods, suggesting a practical path to efficient, robust subject-driven generation.

Abstract

Diffusion models have demonstrated impressive image generation capabilities. Personalized approaches, such as textual inversion and Dreambooth, enhance model individualization using specific images. These methods enable generating images of specific objects based on diverse textual contexts. Our proposed approach aims to retain the model's original knowledge during new information integration, resulting in superior outcomes while necessitating less training time compared to Dreambooth and textual inversion.

An Improved Method for Personalizing Diffusion Models

TL;DR

This paper tackles the challenge of personalizing diffusion-based image generation to a specific object using only a few examples. It proposes a two-stage method that first learns a rare-adjective token embedding and then fine-tunes the diffusion model without prior preservation, freezing the text encoder and the embedding during stage two. Compared with Textual Inversion and Dreambooth, the approach yields higher fidelity with substantially less training time and mitigates language drift and forgetting. Quantitative results using CLIP and DINO scores, along with qualitative assessments, show consistent improvements over baseline personalization methods, suggesting a practical path to efficient, robust subject-driven generation.

Abstract

Diffusion models have demonstrated impressive image generation capabilities. Personalized approaches, such as textual inversion and Dreambooth, enhance model individualization using specific images. These methods enable generating images of specific objects based on diverse textual contexts. Our proposed approach aims to retain the model's original knowledge during new information integration, resulting in superior outcomes while necessitating less training time compared to Dreambooth and textual inversion.
Paper Structure (17 sections, 4 equations, 5 figures, 1 table)

This paper contains 17 sections, 4 equations, 5 figures, 1 table.

Figures (5)

  • Figure 1: Illustration of textual inversion gal2022textual. Given a few sample images of a specific target object, it optimizes the embedding $v"*$ of a newly introduced word/token $S_*$ for the object, enabling the personalization of a pre-trained diffusion model. The embedding $v_*$ is initialized using the embedding of a general class (e.g. "dog") of the target object.
  • Figure 2: Illustration of Dreambooth ruiz2023dreambooth. Given a few sample images of a target object, it fine-tunes a diffusion model by minimizing the sum of two loss functions. The reconstruction loss measures the difference between the generated images and the sample images of the target objects. The prior preservation loss measures the difference between those of 'common class' objects, which aims to mitigate the forgetting of the ability to generate their images.
  • Figure 3: Results of personalization with Dreambooth and our method. Reconstruction of 'backpack', 'dog', and 'cat' subject instances form the dataset in ruiz2023dreambooth. The results from the best-performing checkpoints are selected and shown here. The images enclosed by blue boxes are the results of our method, with the prompts used from left to right being "a photo of $\langle rare \rangle$ backpack", "a photo of $\langle rare \rangle$ backpack on the beach", and "a photo of $\langle rare \rangle$ backpack in the jungle". The red bounding boxes show Dreambooth's results, with the prompts used from left to right being "a photo of sks backpack", "a photo of sks backpack on the beach", and "a photo of sks backpack in the jungle". As we mentioned in \ref{['3.1.2']}, we adopt the standard setting of huggingface that uses 'sks' as the identifier.
  • Figure 4: Blurriness and diminished realism. One of the input sample images is shown on the left and the generated results are on the right. The blue boxes indicate the results of our method and the red boxes indicate those of Dreambooth. The prompt is "a photo of cat". All results are taken from the checkpoints at 200 steps.
  • Figure 5: Degradation of image quality due to extende training steps. Three examples for the object 'plushie' (stuffed animal). The blue boxes indicate the results of our method using the prompt "a photo of $\langle rare \rangle$ plushie". The red boxes indicate those of Dreambooth using the prompt "a photo of sks plushie". From left to right, correspond to checkpoints at training steps 200, 400, 600, and 800.