Table of Contents
Fetching ...

InstaFace: Identity-Preserving Facial Editing with Single Image Inference

MD Wahiduzzaman Khan, Mingshan Jia, Xiaolin Zhang, En Yu, Caifeng Shan, Kaska Musial-Gabrys

TL;DR

InstaFace tackles single-image identity-preserving facial editing under large pose, expression, and lighting changes by marrying a diffusion-based generator with a 3DMM-conditioned latent guidance system and a dual-embedding Identity Preserver. The 3D Fusion Controller ingests 3DMM conditionals into latent space without extra trainable parameters, while the Identity Preserver combines CLIP and face-recognition embeddings to preserve identity and contextual details. The approach is trained in two stages, leveraging DECA-derived conditionals and a CLIP+FR projection module to achieve state-of-the-art identity retention and photorealism, outperforming several baselines. This work enables realistic, identity-consistent facial edits from a single image, with potential impact on digital avatars, AR/VR, and personalized content creation.

Abstract

Facial appearance editing is crucial for digital avatars, AR/VR, and personalized content creation, driving realistic user experiences. However, preserving identity with generative models is challenging, especially in scenarios with limited data availability. Traditional methods often require multiple images and still struggle with unnatural face shifts, inconsistent hair alignment, or excessive smoothing effects. To overcome these challenges, we introduce a novel diffusion-based framework, InstaFace, to generate realistic images while preserving identity using only a single image. Central to InstaFace, we introduce an efficient guidance network that harnesses 3D perspectives by integrating multiple 3DMM-based conditionals without introducing additional trainable parameters. Moreover, to ensure maximum identity retention as well as preservation of background, hair, and other contextual features like accessories, we introduce a novel module that utilizes feature embeddings from a facial recognition model and a pre-trained vision-language model. Quantitative evaluations demonstrate that our method outperforms several state-of-the-art approaches in terms of identity preservation, photorealism, and effective control of pose, expression, and lighting.

InstaFace: Identity-Preserving Facial Editing with Single Image Inference

TL;DR

InstaFace tackles single-image identity-preserving facial editing under large pose, expression, and lighting changes by marrying a diffusion-based generator with a 3DMM-conditioned latent guidance system and a dual-embedding Identity Preserver. The 3D Fusion Controller ingests 3DMM conditionals into latent space without extra trainable parameters, while the Identity Preserver combines CLIP and face-recognition embeddings to preserve identity and contextual details. The approach is trained in two stages, leveraging DECA-derived conditionals and a CLIP+FR projection module to achieve state-of-the-art identity retention and photorealism, outperforming several baselines. This work enables realistic, identity-consistent facial edits from a single image, with potential impact on digital avatars, AR/VR, and personalized content creation.

Abstract

Facial appearance editing is crucial for digital avatars, AR/VR, and personalized content creation, driving realistic user experiences. However, preserving identity with generative models is challenging, especially in scenarios with limited data availability. Traditional methods often require multiple images and still struggle with unnatural face shifts, inconsistent hair alignment, or excessive smoothing effects. To overcome these challenges, we introduce a novel diffusion-based framework, InstaFace, to generate realistic images while preserving identity using only a single image. Central to InstaFace, we introduce an efficient guidance network that harnesses 3D perspectives by integrating multiple 3DMM-based conditionals without introducing additional trainable parameters. Moreover, to ensure maximum identity retention as well as preservation of background, hair, and other contextual features like accessories, we introduce a novel module that utilizes feature embeddings from a facial recognition model and a pre-trained vision-language model. Quantitative evaluations demonstrate that our method outperforms several state-of-the-art approaches in terms of identity preservation, photorealism, and effective control of pose, expression, and lighting.

Paper Structure

This paper contains 12 sections, 6 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Prior methods (left image of each pair) exhibit various types of issues, such as (a) unnatural facial deformations, (b) identity shifts in features like hair, eye color, and face shape, (c) inconsistencies in clothing and hair styling, and (d) artifacts or distortions in the background and accessories. In contrast, our approach (right image of each pair) effectively resolves these issues, preserving natural facial geometry, consistent identity, and coherent styling across all elements. Reference images (Ref.) are provided for (b) and (d).
  • Figure 2: InstaFace leverages a single image to drive complex facial reenactments with conditional controls, including changes in pose, expression, and lighting. Our method ensures that the generated images retain the subject's identity, background, and fine-grained details while accurately reflecting the specified conditions.
  • Figure 3: Overview of InstaFace Architecture: (a) Conditional maps generated by the pre-trained DECA Model are processed by the 3D Fusion Controller to produce latent conditionals, which are then utilized by the Guidance Network to guide the diffusion model; (b) Semantic and identity features are extracted and concatenated to provide conditions for the diffusion process; (c) The Diffusion Network synthesizes the final image, guided by both the Guidance Network and the concatenated embeddings.
  • Figure 4: Baseline comparisons with DECA, HeadNerf, GIF, DiffusionRig, CapHuman, and VOODOO3D. Our method performs better in retaining identity while generating realistic facial images under varying conditions. Here, DiffusionRig is marked with (*) as it necessitates per-subject fine-tuning using a set of 20 images. VOODOO3D does not support lighting variation edits.
  • Figure 5: Evaluation of pose and expression rigging quality during single-image fine-tuning.The input images are expected to follow the corresponding target image's (a) pose and (b) expression.