DiffFAE: Advancing High-fidelity One-shot Facial Appearance Editing with Space-sensitive Customization and Semantic Preservation
Qilin Wang, Jiangning Zhang, Chengming Xu, Weijian Cao, Ying Tai, Yue Han, Yanhao Ge, Hong Gu, Chengjie Wang, Yanwei Fu
TL;DR
This work tackles high-fidelity one-shot facial appearance editing by addressing fidelity, attribute preservation, and inference efficiency with a one-stage diffusion framework. It introduces Space-sensitive Physical Customization (SPC) to render a query texture from 3DMM-based attributes and Region-responsive Semantic Composition (RSC) to extract disentangled source tokens (including an identity token) that control the diffusion process via AdaIN and cross-attention. The model trains with a latent diffusion objective and a novel attention consistency regularization, achieving state-of-the-art results on VoxCeleb1 in terms of FID and identity preservation while enabling fast, finetuning-free inference and expandable editing capabilities. Overall, DiffFAE offers a practical, scalable solution for high-fidelity, controllable facial appearance editing with strong generalization and editing flexibility, supported by extensive ablations and qualitative results. The approach has potential impact in photography and multimedia applications where precise attribute manipulation and source-feature preservation are critical.
Abstract
Facial Appearance Editing (FAE) aims to modify physical attributes, such as pose, expression and lighting, of human facial images while preserving attributes like identity and background, showing great importance in photograph. In spite of the great progress in this area, current researches generally meet three challenges: low generation fidelity, poor attribute preservation, and inefficient inference. To overcome above challenges, this paper presents DiffFAE, a one-stage and highly-efficient diffusion-based framework tailored for high-fidelity FAE. For high-fidelity query attributes transfer, we adopt Space-sensitive Physical Customization (SPC), which ensures the fidelity and generalization ability by utilizing rendering texture derived from 3D Morphable Model (3DMM). In order to preserve source attributes, we introduce the Region-responsive Semantic Composition (RSC). This module is guided to learn decoupled source-regarding features, thereby better preserving the identity and alleviating artifacts from non-facial attributes such as hair, clothes, and background. We further introduce a consistency regularization for our pipeline to enhance editing controllability by leveraging prior knowledge in the attention matrices of diffusion model. Extensive experiments demonstrate the superiority of DiffFAE over existing methods, achieving state-of-the-art performance in facial appearance editing.
