Table of Contents
Fetching ...

DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with Space-sensitive Customization and Semantic Preservation

Qilin Wang, Jiangning Zhang, Chengming Xu, Weijian Cao, Ying Tai, Yue Han, Yanhao Ge, Hong Gu, Chengjie Wang, Yanwei Fu

TL;DR

This work tackles high-fidelity one-shot facial appearance editing by addressing fidelity, attribute preservation, and inference efficiency with a one-stage diffusion framework. It introduces Space-sensitive Physical Customization (SPC) to render a query texture from 3DMM-based attributes and Region-responsive Semantic Composition (RSC) to extract disentangled source tokens (including an identity token) that control the diffusion process via AdaIN and cross-attention. The model trains with a latent diffusion objective and a novel attention consistency regularization, achieving state-of-the-art results on VoxCeleb1 in terms of FID and identity preservation while enabling fast, finetuning-free inference and expandable editing capabilities. Overall, DiffFAE offers a practical, scalable solution for high-fidelity, controllable facial appearance editing with strong generalization and editing flexibility, supported by extensive ablations and qualitative results. The approach has potential impact in photography and multimedia applications where precise attribute manipulation and source-feature preservation are critical.

Abstract

Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient inference. To overcome above challenges, this paper presents DiffFAE, a one-stage and highly-efficient diffusion-based framework tailored for high-fidelity FAE. For high-fidelity query attributes transfer, we adopt Space-sensitive Physical Customization (SPC), which ensures the fidelity and generalization ability by utilizing rendering texture derived from 3D Morphable Model (3DMM). In order to preserve source attributes, we introduce the Region-responsive Semantic Composition (RSC). This module is guided to learn decoupled source-regarding features, thereby better preserving the identity and alleviating artifacts from non-facial attributes such as hair, clothes, and background. We further introduce a consistency regularization for our pipeline to enhance editing controllability by leveraging prior knowledge in the attention matrices of diffusion model. Extensive experiments demonstrate the superiority of DiffFAE over existing methods, achieving state-of-the-art performance in facial appearance editing.

DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with Space-sensitive Customization and Semantic Preservation

TL;DR

This work tackles high-fidelity one-shot facial appearance editing by addressing fidelity, attribute preservation, and inference efficiency with a one-stage diffusion framework. It introduces Space-sensitive Physical Customization (SPC) to render a query texture from 3DMM-based attributes and Region-responsive Semantic Composition (RSC) to extract disentangled source tokens (including an identity token) that control the diffusion process via AdaIN and cross-attention. The model trains with a latent diffusion objective and a novel attention consistency regularization, achieving state-of-the-art results on VoxCeleb1 in terms of FID and identity preservation while enabling fast, finetuning-free inference and expandable editing capabilities. Overall, DiffFAE offers a practical, scalable solution for high-fidelity, controllable facial appearance editing with strong generalization and editing flexibility, supported by extensive ablations and qualitative results. The approach has potential impact in photography and multimedia applications where precise attribute manipulation and source-feature preservation are critical.

Abstract

Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient inference. To overcome above challenges, this paper presents DiffFAE, a one-stage and highly-efficient diffusion-based framework tailored for high-fidelity FAE. For high-fidelity query attributes transfer, we adopt Space-sensitive Physical Customization (SPC), which ensures the fidelity and generalization ability by utilizing rendering texture derived from 3D Morphable Model (3DMM). In order to preserve source attributes, we introduce the Region-responsive Semantic Composition (RSC). This module is guided to learn decoupled source-regarding features, thereby better preserving the identity and alleviating artifacts from non-facial attributes such as hair, clothes, and background. We further introduce a consistency regularization for our pipeline to enhance editing controllability by leveraging prior knowledge in the attention matrices of diffusion model. Extensive experiments demonstrate the superiority of DiffFAE over existing methods, achieving state-of-the-art performance in facial appearance editing.
Paper Structure (21 sections, 7 equations, 15 figures, 6 tables)

This paper contains 21 sections, 7 equations, 15 figures, 6 tables.

Figures (15)

  • Figure 1: Top: Our DiffFAE produces high-fidelity facial compositional editing of three physical appearance attributes, i.e., pose, expression, and lighting, while achieving stronger unedited attributes preservation ability. Bottom: DiffFAE possesses strong disentangled editing capabilities and can be easily extended to other attribute editing, such as background and clothes.
  • Figure 2: Overview of the proposed DiffFAE framework, which consists of: 1)Space-sensitive Physical Customization (SPC) takes query image $\bm{{I}}_{Q}$ as input, which goes through DECA DECA to extract physical coefficients, i.e., albedo $\boldsymbol{\alpha}$, shape $\boldsymbol{\beta}$, camera $\boldsymbol{c}$, pose $\boldsymbol{\rho}$, expression $\boldsymbol{\psi}$, and Spherical Harmonics lighting $\boldsymbol{l}$. These parameters are rendered by renderer $\boldsymbol{\mathcal{R}}$ to get facial texture $\bm{{I}}_{R}$, which is then compressed by pretrained VQ-VAE encoder $\bm{{\phi}}^{E}$ to get its latent representation $\bm{{f}}_{r}$. Then concatenated $\bm{{f}}_{r}$ and noisy latent code $\bm{{z}}^{T}$ are viewed as physical attributes conditioning. 2)Region-responsive Semantic Composition (RSC) includes a region-responsive encoder $\boldsymbol{{\varphi}}^{E}$ and a N-iteration Slot-Attention (SA) module to extract four decoupled feature vectors $\bm{{F}}_{N_{S}}^{N}=\{\bm{{f}}_{1}^{N}, \bm{{f}}_{2}^{N}, \bm{{f}}_{3}^{N}, \bm{{f}}_{4}^{N}\}$ from a randomly initialized $\bm{{F}}_{N_{S}}^{0}$, which represent four different regions from source image $\bm{{I}}_{S}$. Furthermore, an identity extractor $\bm{\Theta}^E$ArcFace parallelly encodes $\bm{{I}}_{S}$ into an embedding $\bm{{f}}_{id}$, which is used together with $\bm{{F}}_{N_{S}}^{N}$ to modulate the denoising U-Net $\bm{{\epsilon}}_{\theta}$ via AdaIN and cross attention, respectively. Finally, the decoder $\bm{\phi}^{D}$ transforms the output of $\bm{\epsilon}_{\theta}$ to the generated output image $\bm{{I}}_{O}$. Notably, $\bm{{I}}_{Q}$ and $\bm{{I}}_{S}$ share the same identity during training. During inference, identity-related physical attributes $\bm{{\alpha}}$/$\boldsymbol{\beta}$ and region attribute $\bm{{f}}_{4}^{N}$ are from $\bm{{I}}_{S}$, marked in red, while other attributes can be customized from any image.
  • Figure 3: Qualitative comparison between our method and current one-shot SOTAs. Note that Hou et al. Hou, StyleHEAT StyleHEAT and DPE DPE cannot handle certain types of attributes, hence the corresponding results are net presented.
  • Figure 4: Comparison between models with different source image processors.
  • Figure 5: Qualitative comparison between models trained with different number of semantic tokens.
  • ...and 10 more figures