Table of Contents
Fetching ...

FaceStudio: Put Your Face Everywhere in Seconds

Yuxuan Yan, Chi Zhang, Rui Wang, Yichao Zhou, Gege Zhang, Pei Cheng, Gang Yu, Bin Fu

TL;DR

The paper tackles identity-preserving image synthesis for humans by introducing a tuning-free, feed-forward framework built on Stable Diffusion. It employs a hybrid guidance strategy that fuses style images, identity cues, and text prompts, plus a multi-identity cross-attention mechanism to map identities to corresponding subjects. Training relies on a reconstruction objective using masked-face style inputs and identity cues, avoiding text annotations. Results show improved identity fidelity and efficiency over DreamBooth and Textual Inversion, with capabilities for novel view synthesis, identity mixing, and multi-human generation, while acknowledging ethical and societal considerations.

Abstract

This study investigates identity-preserving image synthesis, an intriguing task in image generation that seeks to maintain a subject's identity while adding a personalized, stylistic touch. Traditional methods, such as Textual Inversion and DreamBooth, have made strides in custom image creation, but they come with significant drawbacks. These include the need for extensive resources and time for fine-tuning, as well as the requirement for multiple reference images. To overcome these challenges, our research introduces a novel approach to identity-preserving synthesis, with a particular focus on human images. Our model leverages a direct feed-forward mechanism, circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation. Central to our innovation is a hybrid guidance framework, which combines stylized images, facial images, and textual prompts to guide the image generation process. This unique combination enables our model to produce a variety of applications, such as artistic portraits and identity-blended images. Our experimental results, including both qualitative and quantitative evaluations, demonstrate the superiority of our method over existing baseline models and previous works, particularly in its remarkable efficiency and ability to preserve the subject's identity with high fidelity.

FaceStudio: Put Your Face Everywhere in Seconds

TL;DR

The paper tackles identity-preserving image synthesis for humans by introducing a tuning-free, feed-forward framework built on Stable Diffusion. It employs a hybrid guidance strategy that fuses style images, identity cues, and text prompts, plus a multi-identity cross-attention mechanism to map identities to corresponding subjects. Training relies on a reconstruction objective using masked-face style inputs and identity cues, avoiding text annotations. Results show improved identity fidelity and efficiency over DreamBooth and Textual Inversion, with capabilities for novel view synthesis, identity mixing, and multi-human generation, while acknowledging ethical and societal considerations.

Abstract

This study investigates identity-preserving image synthesis, an intriguing task in image generation that seeks to maintain a subject's identity while adding a personalized, stylistic touch. Traditional methods, such as Textual Inversion and DreamBooth, have made strides in custom image creation, but they come with significant drawbacks. These include the need for extensive resources and time for fine-tuning, as well as the requirement for multiple reference images. To overcome these challenges, our research introduces a novel approach to identity-preserving synthesis, with a particular focus on human images. Our model leverages a direct feed-forward mechanism, circumventing the need for intensive fine-tuning, thereby facilitating quick and efficient image generation. Central to our innovation is a hybrid guidance framework, which combines stylized images, facial images, and textual prompts to guide the image generation process. This unique combination enables our model to produce a variety of applications, such as artistic portraits and identity-blended images. Our experimental results, including both qualitative and quantitative evaluations, demonstrate the superiority of our method over existing baseline models and previous works, particularly in its remarkable efficiency and ability to preserve the subject's identity with high fidelity.
Paper Structure (10 sections, 3 equations, 11 figures, 1 table)

This paper contains 10 sections, 3 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: Applications of our proposed framework for identity-preserving image synthesis. Our method can preserve the subject's identity in the synthesized images with high fidelity.
  • Figure 2: Hybrid-Guidance Identity-Preserving Image Synthesis Framework. Our model, built upon StableDiffusion, utilizes text prompts and reference human images to guide image synthesis while preserving human identity through an identity input.
  • Figure 3: Comparison between standard cross-attentions in single-identity modeling (a) and the advanced cross-attentions tailored for multi-identity integration (b).
  • Figure 4: Influence of identity input on image construction. The addition of identity input proves to be effective in preserving the subject's identity within the generated image.
  • Figure 5: Identity-preserving novel view synthesis experiment. Our method excels at generating new views of a subject while maintaining its identity.
  • ...and 6 more figures