Table of Contents
Fetching ...

FlashFace: Human Image Personalization with High-fidelity Identity Preservation

Shilong Zhang, Lianghua Huang, Xi Chen, Yifei Zhang, Zhi-Fan Wu, Yutong Feng, Wei Wang, Yujun Shen, Yu Liu, Ping Luo

TL;DR

FlashFace tackles zero-shot human image personalization with high-fidelity identity preservation and accurate language following. It achieves this by encoding reference faces as spatial feature maps via a dedicated Face ReferenceNet and by injecting reference and text controls through disentangled attention within a diffusion-based framework. A large, multi-identity dataset and a novel training pipeline support robust identity guidance, while flexible inference controls allow balancing prompts and references. Experimental results show superior target-face fidelity and plausible prompt-driven variations, with applications ranging from age/gender editing to artwork-real transformations and face inpainting. The work advances practical subject-driven synthesis while addressing potential misuse and societal impacts.

Abstract

This work presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt. Our approach is distinguishable from existing human photo customization methods by higher-fidelity identity preservation and better instruction following, benefiting from two subtle designs. First, we encode the face identity into a series of feature maps instead of one image token as in prior arts, allowing the model to retain more details of the reference faces (e.g., scars, tattoos, and face shape ). Second, we introduce a disentangled integration strategy to balance the text and image guidance during the text-to-image generation process, alleviating the conflict between the reference faces and the text prompts (e.g., personalizing an adult into a "child" or an "elder"). Extensive experimental results demonstrate the effectiveness of our method on various applications, including human image personalization, face swapping under language prompts, making virtual characters into real people, etc. Project Page: https://jshilong.github.io/flashface-page.

FlashFace: Human Image Personalization with High-fidelity Identity Preservation

TL;DR

FlashFace tackles zero-shot human image personalization with high-fidelity identity preservation and accurate language following. It achieves this by encoding reference faces as spatial feature maps via a dedicated Face ReferenceNet and by injecting reference and text controls through disentangled attention within a diffusion-based framework. A large, multi-identity dataset and a novel training pipeline support robust identity guidance, while flexible inference controls allow balancing prompts and references. Experimental results show superior target-face fidelity and plausible prompt-driven variations, with applications ranging from age/gender editing to artwork-real transformations and face inpainting. The work advances practical subject-driven synthesis while addressing potential misuse and societal impacts.

Abstract

This work presents FlashFace, a practical tool with which users can easily personalize their own photos on the fly by providing one or a few reference face images and a text prompt. Our approach is distinguishable from existing human photo customization methods by higher-fidelity identity preservation and better instruction following, benefiting from two subtle designs. First, we encode the face identity into a series of feature maps instead of one image token as in prior arts, allowing the model to retain more details of the reference faces (e.g., scars, tattoos, and face shape ). Second, we introduce a disentangled integration strategy to balance the text and image guidance during the text-to-image generation process, alleviating the conflict between the reference faces and the text prompts (e.g., personalizing an adult into a "child" or an "elder"). Extensive experimental results demonstrate the effectiveness of our method on various applications, including human image personalization, face swapping under language prompts, making virtual characters into real people, etc. Project Page: https://jshilong.github.io/flashface-page.
Paper Structure (20 sections, 5 equations, 19 figures, 5 tables)

This paper contains 20 sections, 5 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Diverse human image personalization results produced by our proposed FlashFace, which enjoys the features of (1) preserving the identity of reference faces in great details (e.g., tattoos, scars, or even the rare face shape of virtual characters) and (2) accurately following the instructions especially when the text prompts contradict the reference images (e.g., customizing an adult to a "child" or an "elder"). Previous SoTA refers to PhotoMaker li2023photomaker.
  • Figure 1: Reference faces images for each figure in the main script. Most generated images have four reference images, while the virtual person to real person transformation has only one reference face image, which is cropped from the ID Images. Therefore, we omit them in this figure.
  • Figure 2: Concept comparison between FlashFace and previous embedding-based methods. We encode the face to a series of feature maps instead of several tokens to preserve finer details. We do the disentangled integration using separate layers for the reference and text control, which can help to achieve better instruction following ability. We also propose a novel data construction pipeline that ensures facial variation between the reference face and the generated face.
  • Figure 2: Number of clusters for different cluster size
  • Figure 3: The overall pipeline of FlashFace. During training, we randomly select $B$ ID clusters and choose $N+1$ images from each cluster. We crop the face region from $N$ images as references and leave one as the target image. This target image is used to calculate the loss. The input latent of Face ReferenceNet has shape $(B*N) \times 4 \times h \times w$. We store the reference face features after the self-attention layer within the middle blocks and decoder blocks. A face position mask is concatenated to the target latent to indicate the position of the generated face. During the forwarding of the target latent through the corresponding position in the U-Net, we incorporate the reference feature using an additional reference attention layer. During inference, users can obtain the desired image by providing a face position(optional), reference images of the person, and a description of the desired image.
  • ...and 14 more figures