Table of Contents
Fetching ...

PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization

Xu Peng, Junwei Zhu, Boyuan Jiang, Ying Tai, Donghao Luo, Jiangning Zhang, Wei Lin, Taisong Jin, Chengjie Wang, Rongrong Ji

TL;DR

PortraitBooth addresses inefficiency and identity distortion in diffusion-based portrait personalization by introducing Subject Text Embedding Augmention (STEA) to fuse identity with text prompts, Dynamic Identity Preservation (DIP) to maintain fidelity, and Emotion-aware Cross-attention Control (ECAC) for expressive editing. It is designed as a one-shot, tuning-free framework capable of high-fidelity, editable portrait generation and scalable multi-subject creation. Extensive experiments on CelebV-T show superior identity preservation and expression controllability compared with state-of-the-art baselines, while requiring substantially less training and inference overhead. The work provides a practical, scalable baseline for efficient, editable, identity-preserving portrait generation in diffusion models.

Abstract

Recent advancements in personalized image generation using diffusion models have been noteworthy. However, existing methods suffer from inefficiencies due to the requirement for subject-specific fine-tuning. This computationally intensive process hinders efficient deployment, limiting practical usability. Moreover, these methods often grapple with identity distortion and limited expression diversity. In light of these challenges, we propose PortraitBooth, an innovative approach designed for high efficiency, robust identity preservation, and expression-editable text-to-image generation, without the need for fine-tuning. PortraitBooth leverages subject embeddings from a face recognition model for personalized image generation without fine-tuning. It eliminates computational overhead and mitigates identity distortion. The introduced dynamic identity preservation strategy further ensures close resemblance to the original image identity. Moreover, PortraitBooth incorporates emotion-aware cross-attention control for diverse facial expressions in generated images, supporting text-driven expression editing. Its scalability enables efficient and high-quality image creation, including multi-subject generation. Extensive results demonstrate superior performance over other state-of-the-art methods in both single and multiple image generation scenarios.

PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization

TL;DR

PortraitBooth addresses inefficiency and identity distortion in diffusion-based portrait personalization by introducing Subject Text Embedding Augmention (STEA) to fuse identity with text prompts, Dynamic Identity Preservation (DIP) to maintain fidelity, and Emotion-aware Cross-attention Control (ECAC) for expressive editing. It is designed as a one-shot, tuning-free framework capable of high-fidelity, editable portrait generation and scalable multi-subject creation. Extensive experiments on CelebV-T show superior identity preservation and expression controllability compared with state-of-the-art baselines, while requiring substantially less training and inference overhead. The work provides a practical, scalable baseline for efficient, editable, identity-preserving portrait generation in diffusion models.

Abstract

Recent advancements in personalized image generation using diffusion models have been noteworthy. However, existing methods suffer from inefficiencies due to the requirement for subject-specific fine-tuning. This computationally intensive process hinders efficient deployment, limiting practical usability. Moreover, these methods often grapple with identity distortion and limited expression diversity. In light of these challenges, we propose PortraitBooth, an innovative approach designed for high efficiency, robust identity preservation, and expression-editable text-to-image generation, without the need for fine-tuning. PortraitBooth leverages subject embeddings from a face recognition model for personalized image generation without fine-tuning. It eliminates computational overhead and mitigates identity distortion. The introduced dynamic identity preservation strategy further ensures close resemblance to the original image identity. Moreover, PortraitBooth incorporates emotion-aware cross-attention control for diverse facial expressions in generated images, supporting text-driven expression editing. Its scalability enables efficient and high-quality image creation, including multi-subject generation. Extensive results demonstrate superior performance over other state-of-the-art methods in both single and multiple image generation scenarios.
Paper Structure (19 sections, 8 equations, 8 figures, 8 tables)

This paper contains 19 sections, 8 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Comparison of identity information obtained based on the trained image encoder and pre-trained face recognition model.
  • Figure 2: Overview framework of PortraitBooth. PortraitBooth extracts the face $f$ from the input image $x_0$, and augments the subject's features using TFace for improved identity representation. The diffusion model is trained to generate images with enhanced conditioning, incorporating emotion-aware cross-attention for expression editing and dynamic identity preservation to maintain identity. During the testing phase, we only need to input a single image and the corresponding prompt to achieve rapid, robust identity preservation and diverse expression editing capabilities. $A^i_l$, $A^j_l$ represents the cross-attention map corresponding to the $i$-th and $j$-th token at the $l$-th cross-attention layer, respectively. $\beta$ and $\gamma$ represent the maximum values of the cross-attention map for the identity token and expression token respectively, while $R_t$ indicates the timing for identity preservation.
  • Figure 3: Comparison of different methods on single subject image generation in the testing dataset.
  • Figure 4: Comparison of different methods on multi-subject image generation in the testing dataset.
  • Figure 5: Comparison chart of expression editing between our method and FastComposer, focusing on the three most distinct expression terms.
  • ...and 3 more figures