PortraitBooth: A Versatile Portrait Model for Fast Identity-preserved Personalization
Xu Peng, Junwei Zhu, Boyuan Jiang, Ying Tai, Donghao Luo, Jiangning Zhang, Wei Lin, Taisong Jin, Chengjie Wang, Rongrong Ji
TL;DR
PortraitBooth addresses inefficiency and identity distortion in diffusion-based portrait personalization by introducing Subject Text Embedding Augmention (STEA) to fuse identity with text prompts, Dynamic Identity Preservation (DIP) to maintain fidelity, and Emotion-aware Cross-attention Control (ECAC) for expressive editing. It is designed as a one-shot, tuning-free framework capable of high-fidelity, editable portrait generation and scalable multi-subject creation. Extensive experiments on CelebV-T show superior identity preservation and expression controllability compared with state-of-the-art baselines, while requiring substantially less training and inference overhead. The work provides a practical, scalable baseline for efficient, editable, identity-preserving portrait generation in diffusion models.
Abstract
Recent advancements in personalized image generation using diffusion models have been noteworthy. However, existing methods suffer from inefficiencies due to the requirement for subject-specific fine-tuning. This computationally intensive process hinders efficient deployment, limiting practical usability. Moreover, these methods often grapple with identity distortion and limited expression diversity. In light of these challenges, we propose PortraitBooth, an innovative approach designed for high efficiency, robust identity preservation, and expression-editable text-to-image generation, without the need for fine-tuning. PortraitBooth leverages subject embeddings from a face recognition model for personalized image generation without fine-tuning. It eliminates computational overhead and mitigates identity distortion. The introduced dynamic identity preservation strategy further ensures close resemblance to the original image identity. Moreover, PortraitBooth incorporates emotion-aware cross-attention control for diverse facial expressions in generated images, supporting text-driven expression editing. Its scalability enables efficient and high-quality image creation, including multi-subject generation. Extensive results demonstrate superior performance over other state-of-the-art methods in both single and multiple image generation scenarios.
