Table of Contents
Fetching ...

StableIdentity: Inserting Anybody into Anywhere at First Sight

Qinghe Wang, Xu Jia, Xiaomin Li, Taiqing Li, Liqian Ma, Yunzhi Zhuge, Huchuan Lu

TL;DR

StableIdentity tackles one-shot personalized generation by encoding a target identity into a compact word-embedding space anchored by a face-recognition encoder and an editability prior built from celeb-name embeddings. It lands the identity in a celeb-aligned space using AdaIN and employs a masked, two-phase diffusion loss to balance layout accuracy and pixel-level fidelity, improving both identity preservation and editability. The method integrates smoothly with plug-in pipelines like ControlNet and enables zero-shot injection into video and 3D generation without model finetuning, demonstrating broad applicability. Empirical results show superior performance over prior customization approaches across quantitative and qualitative metrics, highlighting its potential to unify image, video, and 3D customized generation.

Abstract

Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models.

StableIdentity: Inserting Anybody into Anywhere at First Sight

TL;DR

StableIdentity tackles one-shot personalized generation by encoding a target identity into a compact word-embedding space anchored by a face-recognition encoder and an editability prior built from celeb-name embeddings. It lands the identity in a celeb-aligned space using AdaIN and employs a masked, two-phase diffusion loss to balance layout accuracy and pixel-level fidelity, improving both identity preservation and editability. The method integrates smoothly with plug-in pipelines like ControlNet and enables zero-shot injection into video and 3D generation without model finetuning, demonstrating broad applicability. Empirical results show superior performance over prior customization approaches across quantitative and qualitative metrics, highlighting its potential to unify image, video, and 3D customized generation.

Abstract

Recent advances in large pretrained text-to-image models have shown unprecedented capabilities for high-quality human-centric generation, however, customizing face identity is still an intractable problem. Existing methods cannot ensure stable identity preservation and flexible editability, even with several images for each subject during training. In this work, we propose StableIdentity, which allows identity-consistent recontextualization with just one face image. More specifically, we employ a face encoder with an identity prior to encode the input face, and then land the face representation into a space with an editable prior, which is constructed from celeb names. By incorporating identity prior and editability prior, the learned identity can be injected anywhere with various contexts. In addition, we design a masked two-phase diffusion loss to boost the pixel-level perception of the input face and maintain the diversity of generation. Extensive experiments demonstrate our method outperforms previous customization methods. In addition, the learned identity can be flexibly combined with the off-the-shelf modules such as ControlNet. Notably, to the best knowledge, we are the first to directly inject the identity learned from a single image into video/3D generation without finetuning. We believe that the proposed StableIdentity is an important step to unify image, video, and 3D customized generation models.
Paper Structure (19 sections, 6 equations, 15 figures, 3 tables)

This paper contains 19 sections, 6 equations, 15 figures, 3 tables.

Figures (15)

  • Figure 1: Given a single input image, the proposed StableIdentity can generate diverse customized images in various contexts. Notably, we present that the learned identity can be combined with ControlNet zhang2023adding and even injected into video (ModelScopeT2V wang2023modelscope) and 3D (LucidDreamer liang2023luciddreamer) generation.
  • Figure 2: Overview of the proposed StableIdentity. Given a single face image, we first employ a FR-ViT encoder and MLPs to capture identity representation, and then land it into our constructed celeb embedding space to better learn identity-consistent editability. In addition, we design a masked two-phase diffusion loss including $\mathcal{L}_{noise}$ and $\mathcal{L}_{rec}$ for training.
  • Figure 3: We present the predicted $\hat{z}_0$ from $z_t$ at various timestep $t$. $\hat{z}_0$ at $t=\{100,200\}$, similar to $t=300$, are omitted for brevity.
  • Figure 4: We present the qualitative comparisons with six baselines for different identities (including various races) and diverse text prompts (covering decoration, action, attribute, background, style). Our method achieves high-quality generation with consistent identity and outstanding editability (Zoom-in for the best view). We provide more results in supplementary material.
  • Figure 5: Ablation study for model architecture. We show the results of using the CLIP image encoder and removing the AdaIN.
  • ...and 10 more figures