Table of Contents
Fetching ...

Inserting Anybody in Diffusion Models via Celeb Basis

Ge Yuan, Xiaodong Cun, Yong Zhang, Maomao Li, Chenyang Qi, Xintao Wang, Ying Shan, Huicheng Zheng

TL;DR

This work introduces a Celeb Basis that represents identities through PCA-based embeddings of celebrity names, enabling efficient one-shot personalization of pretrained diffusion models. By mapping a single face photo to two 512-D coefficient vectors and freezing the diffusion components, the method achieves robust identity preservation and enables interactions between multiple newly learned identities with only 1024 tunable parameters and around 3 minutes of training. The approach outperforms existing personalization methods on identity fidelity and interaction capabilities while offering scalable multi-identity learning via a shared MLP. Ethical considerations are discussed, along with limitations related to facial realism and the potential for misuse, suggesting careful deployment and future improvements in realism and scope.

Abstract

Exquisite demand exists for customizing the pretrained large text-to-image model, $\textit{e.g.}$, Stable Diffusion, to generate innovative concepts, such as the users themselves. However, the newly-added concept from previous customization methods often shows weaker combination abilities than the original ones even given several images during training. We thus propose a new personalization method that allows for the seamless integration of a unique individual into the pre-trained diffusion model using just $\textbf{one facial photograph}$ and only $\textbf{1024 learnable parameters}$ under $\textbf{3 minutes}$. So as we can effortlessly generate stunning images of this person in any pose or position, interacting with anyone and doing anything imaginable from text prompts. To achieve this, we first analyze and build a well-defined celeb basis from the embedding space of the pre-trained large text encoder. Then, given one facial photo as the target identity, we generate its own embedding by optimizing the weight of this basis and locking all other parameters. Empowered by the proposed celeb basis, the new identity in our customized model showcases a better concept combination ability than previous personalization methods. Besides, our model can also learn several new identities at once and interact with each other where the previous customization model fails to. The code will be released.

Inserting Anybody in Diffusion Models via Celeb Basis

TL;DR

This work introduces a Celeb Basis that represents identities through PCA-based embeddings of celebrity names, enabling efficient one-shot personalization of pretrained diffusion models. By mapping a single face photo to two 512-D coefficient vectors and freezing the diffusion components, the method achieves robust identity preservation and enables interactions between multiple newly learned identities with only 1024 tunable parameters and around 3 minutes of training. The approach outperforms existing personalization methods on identity fidelity and interaction capabilities while offering scalable multi-identity learning via a shared MLP. Ethical considerations are discussed, along with limitations related to facial realism and the potential for misuse, suggesting careful deployment and future improvements in realism and scope.

Abstract

Exquisite demand exists for customizing the pretrained large text-to-image model, , Stable Diffusion, to generate innovative concepts, such as the users themselves. However, the newly-added concept from previous customization methods often shows weaker combination abilities than the original ones even given several images during training. We thus propose a new personalization method that allows for the seamless integration of a unique individual into the pre-trained diffusion model using just and only under . So as we can effortlessly generate stunning images of this person in any pose or position, interacting with anyone and doing anything imaginable from text prompts. To achieve this, we first analyze and build a well-defined celeb basis from the embedding space of the pre-trained large text encoder. Then, given one facial photo as the target identity, we generate its own embedding by optimizing the weight of this basis and locking all other parameters. Empowered by the proposed celeb basis, the new identity in our customized model showcases a better concept combination ability than previous personalization methods. Besides, our model can also learn several new identities at once and interact with each other where the previous customization model fails to. The code will be released.
Paper Structure (22 sections, 2 equations, 17 figures, 2 tables)

This paper contains 22 sections, 2 equations, 17 figures, 2 tables.

Figures (17)

  • Figure 1: Given a single facial photo ($v1$ or $v2$) as a tunable sample, the proposed method can insert this identity into the trained text-to-image model, e.g., Stable Diffusion stable-diffusion, where the new person ($v1$) can act like the original concept in the trained model and interact with another newly trained concept ($v2$). Note that the input images are randomly generated from StyleGAN stylegan.
  • Figure 2: The building process of the proposed Celeb Basis. First, we collect about 1,500 celebrity names as the initial collection. Then, we manually filter the initial one to $m=691$ names, based on the synthesis quality of text-to-image diffusion model stable-diffusion with corresponding name prompt. Later, each filtered name is tokenized and encoded into a celeb embedding group $g_i$. Finally, we conduct Principle Component Analysis to build a compact orthogonal basis, which is visualized on the right.
  • Figure 3: The interpolated text-embedding of two celebrities is still a human (top row) and it also can perform strong concept combination abilities in the pretrained Stable Diffusion stable-diffusion (bottom row).
  • Figure 4: During training (left), we optimize the coefficients of the celeb basis with the help of a fixed face encoder. During inference (right), we combine the learned personalized weights and shared celeb basis to generate images with the input identity.
  • Figure 5: We compare several different abilities between our method and baselines (Textural Inversion textual-inversion, Dreambooth dreambooth, and Custom Diffusion custom-diffusion).
  • ...and 12 more figures