Table of Contents
Fetching ...

Towards Consistent and Controllable Image Synthesis for Face Editing

Mengting Wei, Tuomas Varanka, Yante Li, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao

TL;DR

This work tackles controllable, identity-preserving face editing with diffusion models by fusing 3D Morphable Models (3DMMs) with Stable Diffusion. It introduces a Spatial Attribute Provider to decouple background, pose, lighting, and expression, and a FaceFusion module to inject high-fidelity identity features into the SD UNet, all within a full-model fine-tuning framework. The approach achieves superior identity preservation and realism across a range of edits and identities, including out-of-domain styles, and demonstrates robust generalization with a dedicated training strategy and ablations. The results underscore the potential of combining interpretable 3D-based controls with powerful diffusion priors for photorealistic, consistent face editing in practical applications.

Abstract

Face editing methods, essential for tasks like virtual avatars, digital human synthesis and identity preservation, have traditionally been built upon GAN-based techniques, while recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in controlling specific attributes and preserving the consistency of other unchanged attributes especially the identity characteristics. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion (SD) models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involves the combinations of target background, identity and face attributes aimed to edit. We strive to sufficiently disentangle the control of these factors to enable consistency of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Attribute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) A high-consistency FaceFusion method that transfers identity features from the Identity Encoder to the denoising UNet of a pre-trained SD model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models.

Towards Consistent and Controllable Image Synthesis for Face Editing

TL;DR

This work tackles controllable, identity-preserving face editing with diffusion models by fusing 3D Morphable Models (3DMMs) with Stable Diffusion. It introduces a Spatial Attribute Provider to decouple background, pose, lighting, and expression, and a FaceFusion module to inject high-fidelity identity features into the SD UNet, all within a full-model fine-tuning framework. The approach achieves superior identity preservation and realism across a range of edits and identities, including out-of-domain styles, and demonstrates robust generalization with a dedicated training strategy and ablations. The results underscore the potential of combining interpretable 3D-based controls with powerful diffusion priors for photorealistic, consistent face editing in practical applications.

Abstract

Face editing methods, essential for tasks like virtual avatars, digital human synthesis and identity preservation, have traditionally been built upon GAN-based techniques, while recent focus has shifted to diffusion-based models due to their success in image reconstruction. However, diffusion models still face challenges in controlling specific attributes and preserving the consistency of other unchanged attributes especially the identity characteristics. To address these issues and facilitate more convenient editing of face images, we propose a novel approach that leverages the power of Stable-Diffusion (SD) models and crude 3D face models to control the lighting, facial expression and head pose of a portrait photo. We observe that this task essentially involves the combinations of target background, identity and face attributes aimed to edit. We strive to sufficiently disentangle the control of these factors to enable consistency of face editing. Specifically, our method, coined as RigFace, contains: 1) A Spatial Attribute Encoder that provides presise and decoupled conditions of background, pose, expression and lighting; 2) A high-consistency FaceFusion method that transfers identity features from the Identity Encoder to the denoising UNet of a pre-trained SD model; 3) An Attribute Rigger that injects those conditions into the denoising UNet. Our model achieves comparable or even superior performance in both identity preservation and photorealism compared to existing face editing models.

Paper Structure

This paper contains 15 sections, 2 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Consistent and controllable face editing results given identity images. Our approach is capable of editing arbitrary identities with new facial expression, pose and lighting, generating clear and stable results while maintaining consistency with the attributes unintended to change.
  • Figure 2: Overview pipeline of RigFace. The Spatial Attribute Provider adapts the foreground mask and predict 3D rendering as well as expression coefficients, offering decoupled and more clear guidance for controlled generation. The mask and rendering are encoded using Attribute and fused with noise, followed by the Denoising UNet conducting the denoising process for generation. The expression coefficients are directly encoded in the Denoising UNet. FaceFusion involves extracting detailed features from source image (identity) through Identity Encoder and utilized for Self-Attention. The Identity Encoder and Denoising UNet are completely fine-tuned to better adapt the prior knowledge of SD to our task.
  • Figure 3: Illustration of how the background is parsed based on the source and target images in the training data. The left panel shows the result of attaching only the masked-out facial region from the source image onto the background of the target image. The right panel demonstrates attaching the source facial region using a combination of the source and target masks to better align with the target background. The red box highlights regions of target identity leakage, where residual facial information from the target image remains visible.
  • Figure 4: Qualitative comparison between RigFace and other baselines on real face images. The slashed cells indicate unsupported editing scenarios. Zoom in to view image details.
  • Figure 5: Qualitative comparison between RigFace and other baselines across multiple styles. The slashed cells indicate unsupported editing scenarios. From top to bottom, the styles are oil painting, cyberpunk-styled, cartoon and comic. Zoom in to view image details.
  • ...and 6 more figures