Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

Qingcheng Zhao; Pengyu Long; Qixuan Zhang; Dafei Qin; Han Liang; Longwen Zhang; Yingliang Zhang; Jingyi Yu; Lan Xu

Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

Qingcheng Zhao, Pengyu Long, Qixuan Zhang, Dafei Qin, Han Liang, Longwen Zhang, Yingliang Zhang, Jingyi Yu, Lan Xu

TL;DR

This work tackles the scarcity of high-quality 4D facial data and the need for flexible conditioning in co-speech 3D facial animation. It introduces General Neural Parametric Facial Asset (GNPFA), a neural latent space that disentangles expression from identity using Range of Motion (RoM) data, and constructs the Media2Face Diffusion model operating in this latent space, guided by multi-modal inputs ($A$ from audio, $P$ from CLIP for text/image prompts). The resulting Media2Face model achieves high-fidelity lip-sync, nuanced expressions, and rhythmically aligned head motion, with the large, diverse M2F-D dataset enabling robust learning; it also supports keyframe editing and CLIP-guided style control for flexible editing. The approach demonstrates strong quantitative and qualitative performance gains over prior methods and offers practical applications for real-time, multi-modal, stylized co-speech facial animation in virtual agents and related AI systems.

Abstract

The synthesis of 3D facial animations from speech has garnered considerable attention. Due to the scarcity of high-quality 4D facial data and well-annotated abundant multi-modality labels, previous methods often suffer from limited realism and a lack of lexible conditioning. We address this challenge through a trilogy. We first introduce Generalized Neural Parametric Facial Asset (GNPFA), an efficient variational auto-encoder mapping facial geometry and images to a highly generalized expression latent space, decoupling expressions and identities. Then, we utilize GNPFA to extract high-quality expressions and accurate head poses from a large array of videos. This presents the M2F-D dataset, a large, diverse, and scan-level co-speech 3D facial animation dataset with well-annotated emotional and style labels. Finally, we propose Media2Face, a diffusion model in GNPFA latent space for co-speech facial animation generation, accepting rich multi-modality guidances from audio, text, and image. Extensive experiments demonstrate that our model not only achieves high fidelity in facial animation synthesis but also broadens the scope of expressiveness and style adaptability in 3D facial animation.

Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

TL;DR

from audio,

from CLIP for text/image prompts). The resulting Media2Face model achieves high-fidelity lip-sync, nuanced expressions, and rhythmically aligned head motion, with the large, diverse M2F-D dataset enabling robust learning; it also supports keyframe editing and CLIP-guided style control for flexible editing. The approach demonstrates strong quantitative and qualitative performance gains over prior methods and offers practical applications for real-time, multi-modal, stylized co-speech facial animation in virtual agents and related AI systems.

Abstract

Paper Structure (26 sections, 10 equations, 8 figures, 2 tables)

This paper contains 26 sections, 10 equations, 8 figures, 2 tables.

Introduction
Related Work
3D Facial Animation Representations
Conditional Facial Animation Synthesis
Reshape Facial Animation Data
Expression Latent Space Learning
Training data
Geometry VAE
Image Facial Expression Extraction
Latent-based Facial Animation Dataset
Media2Face Methods
Facial Animation Latent Diffusion Models
Training
Inference
Overlapped batching denoising
...and 11 more sections

Figures (8)

Figure 1: Given the speech signal and multi-modal conditions (Left), our system generates personalized and stylized co-speech facial animation and head poses (Middle, Right).
Figure 2: GNPFA pipeline. (Left:) We train a geometry VAE to learn a latent space of expression and head pose, disentangling expression with identity. (Right:) Two vision encoders are trained to extract expression latent codes and head poses from RGB images, which enables us to capture a wide array of 4D data.
Figure 3: Architecture of Media2Face. Our model takes audio features and CLIP latent code as conditions and denoise the noised sequence of expression latent code together with head pose i.e. head motion code. The conditions are randomly masked and subjected to cross-attention with the noisy head motion code. At inference, we sample head motion codes by DDIM. We feed the expression latent code to the GNPFA decoder to extract the expression geometry, combined with a model template to produce facial animation enhanced by head pose parameters.
Figure 4: Application show case. We can fine-tune the generated facial animation (Row 2) by 1. extracting key-frame expression latent codes through our expression encoder (Row 3), 2. providing per-frame style prompts through CLIP (Row 4, Left: happy, Right: Sad). The intensity and range of control can be adjusted using diffusion in-betweening techniques.
Figure 5: User study result. Note how our method has demonstrated overwhelming superiority in the singing cases, showcasing the model's ability to generate rich emotions and rhythmic head movements.
...and 3 more figures

Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

TL;DR

Abstract

Media2Face: Co-speech Facial Animation Generation With Multi-Modality Guidance

Authors

TL;DR

Abstract

Table of Contents

Figures (8)