Table of Contents
Fetching ...

MagicNaming: Consistent Identity Generation by Finding a "Name Space" in T2I Diffusion Models

Jing Zhao, Heliang Zheng, Chaoyue Wang, Long Lan, Wanrong Hunag, Yuhua Tang

TL;DR

This work addresses extending consistent identity generation from celebrities to generic identities in text-to-image diffusion by discovering a controllable $\mathcal{N}$ Space. It decouples identity cues from semantic prompts, learns an image-to-$\mathcal{N}$ Space encoder using a large LaionCele dataset, and injects identity information into prompts via name prepending without fine-tuning the base model. The approach yields strong identity consistency while preserving the original semantic and stylistic capabilities of diffusion models, and supports Fictional Identities through interpolation in the $\mathcal{N}$ Space. Practically, this enables scalable, noninvasive identity control across scene construction, stylization, action, and emotion tasks, with broad applicability to U-net variants and media production workflows.

Abstract

Large-scale text-to-image diffusion models, (e.g., DALL-E, SDXL) are capable of generating famous persons by simply referring to their names. Is it possible to make such models generate generic identities as simple as the famous ones, e.g., just use a name? In this paper, we explore the existence of a "Name Space", where any point in the space corresponds to a specific identity. Fortunately, we find some clues in the feature space spanned by text embedding of celebrities' names. Specifically, we first extract the embeddings of celebrities' names in the Laion5B dataset with the text encoder of diffusion models. Such embeddings are used as supervision to learn an encoder that can predict the name (actually an embedding) of a given face image. We experimentally find that such name embeddings work well in promising the generated image with good identity consistency. Note that like the names of celebrities, our predicted name embeddings are disentangled from the semantics of text inputs, making the original generation capability of text-to-image models well-preserved. Moreover, by simply plugging such name embeddings, all variants (e.g., from Civitai) derived from the same base model (i.e., SDXL) readily become identity-aware text-to-image models. Project homepage: \url{https://magicfusion.github.io/MagicNaming/}.

MagicNaming: Consistent Identity Generation by Finding a "Name Space" in T2I Diffusion Models

TL;DR

This work addresses extending consistent identity generation from celebrities to generic identities in text-to-image diffusion by discovering a controllable Space. It decouples identity cues from semantic prompts, learns an image-to- Space encoder using a large LaionCele dataset, and injects identity information into prompts via name prepending without fine-tuning the base model. The approach yields strong identity consistency while preserving the original semantic and stylistic capabilities of diffusion models, and supports Fictional Identities through interpolation in the Space. Practically, this enables scalable, noninvasive identity control across scene construction, stylization, action, and emotion tasks, with broad applicability to U-net variants and media production workflows.

Abstract

Large-scale text-to-image diffusion models, (e.g., DALL-E, SDXL) are capable of generating famous persons by simply referring to their names. Is it possible to make such models generate generic identities as simple as the famous ones, e.g., just use a name? In this paper, we explore the existence of a "Name Space", where any point in the space corresponds to a specific identity. Fortunately, we find some clues in the feature space spanned by text embedding of celebrities' names. Specifically, we first extract the embeddings of celebrities' names in the Laion5B dataset with the text encoder of diffusion models. Such embeddings are used as supervision to learn an encoder that can predict the name (actually an embedding) of a given face image. We experimentally find that such name embeddings work well in promising the generated image with good identity consistency. Note that like the names of celebrities, our predicted name embeddings are disentangled from the semantics of text inputs, making the original generation capability of text-to-image models well-preserved. Moreover, by simply plugging such name embeddings, all variants (e.g., from Civitai) derived from the same base model (i.e., SDXL) readily become identity-aware text-to-image models. Project homepage: \url{https://magicfusion.github.io/MagicNaming/}.

Paper Structure

This paper contains 26 sections, 8 equations, 16 figures, 3 tables.

Figures (16)

  • Figure 1: Consistent Identity Generation by fetching"names" from the "Name Space".
  • Figure 2: Name embeddings and textual semantics are disentangled.
  • Figure 3: Overview of the Proposed Method. (a) Dataset Construction. Celebrity images and their corresponding names were extracted from the Laion5b dataset, with name embeddings generated through a text encoder $E_{text}$ based on the names. (b) Image Encoder Architecture and Training. Features were extracted from input images using two CLIP image encoders, followed by a three-layer fully connected network to produce the "name" prediction. Mean Squared Error (MSE) loss was computed against the ground truth name embedding $F_{ID}^{gt}$. (c) Pipeline Inference. The name embedding predicted by the image encoders $E_{image}$ were combined with the original text embeddings by "name" Prepending ($NP(\cdot)$) to obtain final embeddings $F_{prompt}^{ID}$. These embeddings were then used to guide a U-net model for denoising. (d) "Name" Integrating. There is no need to set a specific placeholder or identify its position, simply inserting the name embedding between the start token(red block) and the first semantic token(blue block) is sufficient to achieve consistent identity generation. The padding tokens(yellow block) exceeding the length of 77 will be discarded.
  • Figure 4: Visualization and Comparison. The results concludes four tasks, i.e., scene construction, stylization, action control and emotional editing. Please devote attention to semantic consistency and visual aesthetics of images with the same level of concern as given to ID consistency. The results demonstrate that our approach maintains ID consistency while perfectly preserving the original semantic performance (complex semantic consistency) of the generator(SDXL), a feat not achieved by other works.
  • Figure 5: A fictional character is created by interpolating, and the fictional character supports consistent identity generation.
  • ...and 11 more figures