Table of Contents
Fetching ...

Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis

Salaheldin Mohamed, Dong Han, Yong Li

TL;DR

This work uses the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process, and innovatively alters the cross-attention layers of the UNet to effectively fuse individual identities into the generative process.

Abstract

Text-to-image (T2I) models have significantly advanced the development of artificial intelligence, enabling the generation of high-quality images in diverse contexts based on specific text prompts. However, existing T2I-based methods often struggle to accurately reproduce the appearance of individuals from a reference image and to create novel representations of those individuals in various settings. To address this, we leverage the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process. Our approach diverges from prior methods that depend on fixed encoders or static face embeddings, which often fail to bridge encoding gaps. Instead, we capitalize on UNet's sophisticated encoding capabilities to process reference images across multiple scales. By innovatively altering the cross-attention layers of the UNet, we effectively fuse individual identities into the generative process. This strategic integration of facial features across various scales not only enhances the robustness and consistency of the generated images but also facilitates efficient multi-reference and multi-identity generation. Our method sets a new benchmark in identity-preserving image generation, delivering state-of-the-art results in similarity metrics while maintaining prompt alignment.

Fusion is all you need: Face Fusion for Customized Identity-Preserving Image Synthesis

TL;DR

This work uses the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process, and innovatively alters the cross-attention layers of the UNet to effectively fuse individual identities into the generative process.

Abstract

Text-to-image (T2I) models have significantly advanced the development of artificial intelligence, enabling the generation of high-quality images in diverse contexts based on specific text prompts. However, existing T2I-based methods often struggle to accurately reproduce the appearance of individuals from a reference image and to create novel representations of those individuals in various settings. To address this, we leverage the pre-trained UNet from Stable Diffusion to incorporate the target face image directly into the generation process. Our approach diverges from prior methods that depend on fixed encoders or static face embeddings, which often fail to bridge encoding gaps. Instead, we capitalize on UNet's sophisticated encoding capabilities to process reference images across multiple scales. By innovatively altering the cross-attention layers of the UNet, we effectively fuse individual identities into the generative process. This strategic integration of facial features across various scales not only enhances the robustness and consistency of the generated images but also facilitates efficient multi-reference and multi-identity generation. Our method sets a new benchmark in identity-preserving image generation, delivering state-of-the-art results in similarity metrics while maintaining prompt alignment.
Paper Structure (17 sections, 7 equations, 6 figures, 2 tables)

This paper contains 17 sections, 7 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: Comparison with other SOTA methods in terms of facial expression controllability and prompt alignment, first row are the input identities. It’s worth to notice that IPA-FaceId-Plus ipadapter loses the face fidelity while InstantID wang2024instantid loses a bit of the prompt alignment. On the other hand our method maintains ID preservation with accurate facial expression generation.
  • Figure 2: The overall pipeline of the proposed method. We first concatenate the target face image with the image to be generated for applying attention at different scales, we modify cross-attention layers to include new keys and values layers, their output is concatenated with the output of the original ones from text prompt.
  • Figure 3: Comparison between using n references to guide the generation by our methods, first column represents the references identities, second column is the guidance using only one image of the references, third is using two references And fourth using all three, we notice that generated face is a mix of the features in the references.
  • Figure 4: Multiple face identities image generation using our proposed method, first row represents the three input distinct identities, following row contains sample results for different prompts, depthmap based ControlNet zhang2023adding was used for the pose.
  • Figure 5: Comparison with other SOTA methods, with prompt used as "astronaut in a garden" and the first row represents the input target faces identities, the following rows are results from IP-faceAdapter ipadapter, IP-face-Adapter-plus ipadapter and ours respectively. Identities are selected from ffhq dataset karras2019style.
  • ...and 1 more figures