MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control

Mengting Wei; Tuomas Varanka; Xingxun Jiang; Huai-Qian Khor; Guoying Zhao

MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control

Mengting Wei, Tuomas Varanka, Xingxun Jiang, Huai-Qian Khor, Guoying Zhao

TL;DR

MagicFace addresses the problem of high-fidelity facial expression editing by conditioning a diffusion model on action-unit (AU) variations while preserving identity, pose, and background. It introduces an ID encoder that merges identity features via self-attention and an Attribute Controller to separate background/pose from facial edits, enabling precise and continuous AU-driven editing across arbitrary identities. The approach uses AU variations defined as $\mathbf{c}_{AU} = \mathbf{c}_{ID} - \mathbf{c}_{tgt}$ and employs AU dropout with classifier-free guidance, trained on 30K Aff-Wild identity pairs with an AU-edit loss $\mathcal{L}_{AUEdit}$, achieving strong AU accuracy and robust identity preservation, even in out-of-domain scenarios. The work demonstrates practical, user-friendly facial expression editing with potential applications in avatars and digital media, while acknowledging societal implications and the need for safeguards against misuse.

Abstract

We address the problem of facial expression editing by controling the relative variation of facial action-unit (AU) from the same person. This enables us to edit this specific person's expression in a fine-grained, continuous and interpretable manner, while preserving their identity, pose, background and detailed facial attributes. Key to our model, which we dub MagicFace, is a diffusion model conditioned on AU variations and an ID encoder to preserve facial details of high consistency. Specifically, to preserve the facial details with the input identity, we leverage the power of pretrained Stable-Diffusion models and design an ID encoder to merge appearance features through self-attention. To keep background and pose consistency, we introduce an efficient Attribute Controller by explicitly informing the model of current background and pose of the target. By injecting AU variations into a denoising UNet, our model can animate arbitrary identities with various AU combinations, yielding superior results in high-fidelity expression editing compared to other facial expression editing works. Code is publicly available at https://github.com/weimengting/MagicFace.

MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control

TL;DR

and employs AU dropout with classifier-free guidance, trained on 30K Aff-Wild identity pairs with an AU-edit loss

, achieving strong AU accuracy and robust identity preservation, even in out-of-domain scenarios. The work demonstrates practical, user-friendly facial expression editing with potential applications in avatars and digital media, while acknowledging societal implications and the need for safeguards against misuse.

Abstract

Paper Structure (15 sections, 4 equations, 10 figures, 7 tables)

This paper contains 15 sections, 4 equations, 10 figures, 7 tables.

Introduction
Related Works
Facial Action Units
Facial Expression Editing
Method
Preliminariy
Architecture
AU dropout
Training Strategy
Experiments
Implementations
Results
Ablation Study
Impact Statement
Discussion and Conclusion

Figures (10)

Figure 1: MagicFace takes in the AU changes based on the input portrait and edit the portrait to exhibit different expressions. The edited image respects the AU condition and preserve identity, pose, background as well as other facial details.
Figure 2: A display showcasing various action units and their corresponding intensity scales. Only a set of commonly used AUs is displayed here. For a complete collection of AUs with descriptions, see ozelfaces.
Figure 3: Overview of MagicFace. During training, a pair of images with the same identity but different pose, backgrounds, and expressions is used, respectively as the identity image and the target. AU variations are computed by an estimator and then sent into the denoising UNet as an AU condition. Pose and background of the target are parsed into an image condition independently, dealt with an Attribute Controller, and then input to the denoising UNet. ID encoder takes in the encoded identity image to edit for target AUs, where features in each transformer block are merged into the corresponding ones of the denoising UNet via self-attention. During inference, the conditional image will be parsed from the identity image.
Figure 4: Qualitative comparison for continuous facial expression editing. Our method excels in maintaining exceptional detail features of the face, while allowing flexible, fine-grained control over the expression intensity. Please zoom in for a more detailed observation.
Figure 5: Qualitative comparison with representative methods on discrete facial expression editing. The leftmost column shows the input images used as the editing source, and the remaining columns display the edited results for each target expression.
...and 5 more figures

MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control

TL;DR

Abstract

MagicFace: High-Fidelity Facial Expression Editing with Action-Unit Control

Authors

TL;DR

Abstract

Table of Contents

Figures (10)