InstaFace: Identity-Preserving Facial Editing with Single Image Inference
MD Wahiduzzaman Khan, Mingshan Jia, Xiaolin Zhang, En Yu, Caifeng Shan, Kaska Musial-Gabrys
TL;DR
InstaFace tackles single-image identity-preserving facial editing under large pose, expression, and lighting changes by marrying a diffusion-based generator with a 3DMM-conditioned latent guidance system and a dual-embedding Identity Preserver. The 3D Fusion Controller ingests 3DMM conditionals into latent space without extra trainable parameters, while the Identity Preserver combines CLIP and face-recognition embeddings to preserve identity and contextual details. The approach is trained in two stages, leveraging DECA-derived conditionals and a CLIP+FR projection module to achieve state-of-the-art identity retention and photorealism, outperforming several baselines. This work enables realistic, identity-consistent facial edits from a single image, with potential impact on digital avatars, AR/VR, and personalized content creation.
Abstract
Facial appearance editing is crucial for digital avatars, AR/VR, and personalized content creation, driving realistic user experiences. However, preserving identity with generative models is challenging, especially in scenarios with limited data availability. Traditional methods often require multiple images and still struggle with unnatural face shifts, inconsistent hair alignment, or excessive smoothing effects. To overcome these challenges, we introduce a novel diffusion-based framework, InstaFace, to generate realistic images while preserving identity using only a single image. Central to InstaFace, we introduce an efficient guidance network that harnesses 3D perspectives by integrating multiple 3DMM-based conditionals without introducing additional trainable parameters. Moreover, to ensure maximum identity retention as well as preservation of background, hair, and other contextual features like accessories, we introduce a novel module that utilizes feature embeddings from a facial recognition model and a pre-trained vision-language model. Quantitative evaluations demonstrate that our method outperforms several state-of-the-art approaches in terms of identity preservation, photorealism, and effective control of pose, expression, and lighting.
