Table of Contents
Fetching ...

DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation

Zhuowei Chen, Shancheng Fang, Wei Liu, Qian He, Mengqi Huang, Yongdong Zhang, Zhendong Mao

TL;DR

This work targets the problem of preserving a specific face identity in text-guided image synthesis without expensive per-identity optimization. It introduces DreamIdentity, featuring a Multi-word Multi-scale ID encoder that projects rich, multi-scale face features into multiple pseudo-words, combined with Self-Augmented Editability Learning that trains the model for editing tasks using a self-generated celebrity-based dataset. The method achieves superior identity preservation and text-alignment while maintaining the editability of the underlying diffusion model, and it does so with very fast encoding times. Overall, DreamIdentity enables efficient, identity-preserving, and editable face image generation for unseen identities directly from a single input image, and supports practical applications like scene switching and identity re-contextualization.

Abstract

While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centric images, an intractable problem is how to preserve the face identity for conditioned face images. Existing methods either require time-consuming optimization for each face-identity or learning an efficient encoder at the cost of harming the editability of models. In this work, we present an optimization-free method for each face identity, meanwhile keeping the editability for text-to-image models. Specifically, we propose a novel face-identity encoder to learn an accurate representation of human faces, which applies multi-scale face features followed by a multi-embedding projector to directly generate the pseudo words in the text embedding space. Besides, we propose self-augmented editability learning to enhance the editability of models, which is achieved by constructing paired generated face and edited face images using celebrity names, aiming at transferring mature ability of off-the-shelf text-to-image models in celebrity faces to unseen faces. Extensive experiments show that our methods can generate identity-preserved images under different scenes at a much faster speed.

DreamIdentity: Improved Editability for Efficient Face-identity Preserved Image Generation

TL;DR

This work targets the problem of preserving a specific face identity in text-guided image synthesis without expensive per-identity optimization. It introduces DreamIdentity, featuring a Multi-word Multi-scale ID encoder that projects rich, multi-scale face features into multiple pseudo-words, combined with Self-Augmented Editability Learning that trains the model for editing tasks using a self-generated celebrity-based dataset. The method achieves superior identity preservation and text-alignment while maintaining the editability of the underlying diffusion model, and it does so with very fast encoding times. Overall, DreamIdentity enables efficient, identity-preserving, and editable face image generation for unseen identities directly from a single input image, and supports practical applications like scene switching and identity re-contextualization.

Abstract

While large-scale pre-trained text-to-image models can synthesize diverse and high-quality human-centric images, an intractable problem is how to preserve the face identity for conditioned face images. Existing methods either require time-consuming optimization for each face-identity or learning an efficient encoder at the cost of harming the editability of models. In this work, we present an optimization-free method for each face identity, meanwhile keeping the editability for text-to-image models. Specifically, we propose a novel face-identity encoder to learn an accurate representation of human faces, which applies multi-scale face features followed by a multi-embedding projector to directly generate the pseudo words in the text embedding space. Besides, we propose self-augmented editability learning to enhance the editability of models, which is achieved by constructing paired generated face and edited face images using celebrity names, aiming at transferring mature ability of off-the-shelf text-to-image models in celebrity faces to unseen faces. Extensive experiments show that our methods can generate identity-preserved images under different scenes at a much faster speed.
Paper Structure (19 sections, 5 equations, 12 figures, 2 tables)

This paper contains 19 sections, 5 equations, 12 figures, 2 tables.

Figures (12)

  • Figure 1: Given only one facial image, DreamIdentity can efficiently generate countless identity-preserved and text-coherent images in different context without any test-time optimization.
  • Figure 2: Overview of the proposed DreamIdentity: (a) The training and inference pipeline. The input face image is first encoded into multi-word embeddings (denoted by $S^*$) by our proposed $M^2$ ID encoder. Then $S^*$ are associated with the text input to generate face-identity preserved image in the text-aligned scene. (b) The architecture of $M^2$ ID encoder, where a ViT-based face identity encoder is adopted as the backbone and the extracted multi-scale features are projected to multi-word embedding. (c) The composition of the training data and its objectives. The training data consists of a public face dataset for reconstruction and a self-augmented dataset for editability learning.
  • Figure 3: Qualitative comparisons with state-of-the-art methods. DreamIdentity can generate better text-aligned and ID-preserved images.
  • Figure 4: Qualitative comparisons between ID Encoder and the multi-scale features. The editing prompt is "S* as a chef, looking at the camera". We could conclude that both ID Encoder and the multi-scale features greatly improve the ID preservation (i.e., face-similarity).
  • Figure 5: Ablation study on self-augmented editability learning. Recon denotes reconstruction training. self-aug denotes self-augmented editability learning, the editability gets improved after applying self-aug.
  • ...and 7 more figures