An Improved Method for Personalizing Diffusion Models
Yan Zeng, Masanori Suganuma, Takayuki Okatani
TL;DR
This paper tackles the challenge of personalizing diffusion-based image generation to a specific object using only a few examples. It proposes a two-stage method that first learns a rare-adjective token embedding and then fine-tunes the diffusion model without prior preservation, freezing the text encoder and the embedding during stage two. Compared with Textual Inversion and Dreambooth, the approach yields higher fidelity with substantially less training time and mitigates language drift and forgetting. Quantitative results using CLIP and DINO scores, along with qualitative assessments, show consistent improvements over baseline personalization methods, suggesting a practical path to efficient, robust subject-driven generation.
Abstract
Diffusion models have demonstrated impressive image generation capabilities. Personalized approaches, such as textual inversion and Dreambooth, enhance model individualization using specific images. These methods enable generating images of specific objects based on diverse textual contexts. Our proposed approach aims to retain the model's original knowledge during new information integration, resulting in superior outcomes while necessitating less training time compared to Dreambooth and textual inversion.
