Table of Contents
Fetching ...

GMTalker: Gaussian Mixture-based Audio-Driven Emotional Talking Video Portraits

Yibo Xia, Lizhen Wang, Xiang Deng, Xiaoyan Luo, Yunhong Wang, Yebin Liu

TL;DR

GMTalker tackles audio-driven emotional talking head synthesis by introducing a Gaussian Mixture Expression Generator to model a continuous, disentangled emotion latent space, paired with a Transformer-based MoG mapper and a decoder for accurate lip-sync and emotion expression. It further mitigates the mean-motion issue with a Normalizing Flow-based Motion Generator pretrained on VoxCeleb2 to produce diverse, natural head poses, blinks, and gaze, and adds an Emotion Mapping Network for personalized stylistic control via a StyleUNet-based head generator. The framework achieves superior emotion accuracy, visual quality, and motion diversity across MEAD, CREMA-D, and LSP benchmarks, with smooth emotion interpolation demonstrated by new metrics such as Emotion Perceptual Path Length and Emotion Perceptual Distance Variance. These contributions enable precise, continuous emotion manipulation and personalized speaking styles, offering a practical pathway for high-fidelity, controllable talking portraits in education, entertainment, and virtual human applications.

Abstract

Synthesizing high-fidelity and emotion-controllable talking video portraits, with audio-lip sync, vivid expressions, realistic head poses, and eye blinks, has been an important and challenging task in recent years. Most existing methods suffer in achieving personalized and precise emotion control, smooth transitions between different emotion states, and the generation of diverse motions. To tackle these challenges, we present GMTalker, a Gaussian mixture-based emotional talking portraits generation framework. Specifically, we propose a Gaussian mixture-based expression generator that can construct a continuous and disentangled latent space, achieving more flexible emotion manipulation. Furthermore, we introduce a normalizing flow-based motion generator pretrained on a large dataset with a wide-range motion to generate diverse head poses, blinks, and eyeball movements. Finally, we propose a personalized emotion-guided head generator with an emotion mapping network that can synthesize high-fidelity and faithful emotional video portraits. Both quantitative and qualitative experiments demonstrate our method outperforms previous methods in image quality, photo-realism, emotion accuracy, and motion diversity.

GMTalker: Gaussian Mixture-based Audio-Driven Emotional Talking Video Portraits

TL;DR

GMTalker tackles audio-driven emotional talking head synthesis by introducing a Gaussian Mixture Expression Generator to model a continuous, disentangled emotion latent space, paired with a Transformer-based MoG mapper and a decoder for accurate lip-sync and emotion expression. It further mitigates the mean-motion issue with a Normalizing Flow-based Motion Generator pretrained on VoxCeleb2 to produce diverse, natural head poses, blinks, and gaze, and adds an Emotion Mapping Network for personalized stylistic control via a StyleUNet-based head generator. The framework achieves superior emotion accuracy, visual quality, and motion diversity across MEAD, CREMA-D, and LSP benchmarks, with smooth emotion interpolation demonstrated by new metrics such as Emotion Perceptual Path Length and Emotion Perceptual Distance Variance. These contributions enable precise, continuous emotion manipulation and personalized speaking styles, offering a practical pathway for high-fidelity, controllable talking portraits in education, entertainment, and virtual human applications.

Abstract

Synthesizing high-fidelity and emotion-controllable talking video portraits, with audio-lip sync, vivid expressions, realistic head poses, and eye blinks, has been an important and challenging task in recent years. Most existing methods suffer in achieving personalized and precise emotion control, smooth transitions between different emotion states, and the generation of diverse motions. To tackle these challenges, we present GMTalker, a Gaussian mixture-based emotional talking portraits generation framework. Specifically, we propose a Gaussian mixture-based expression generator that can construct a continuous and disentangled latent space, achieving more flexible emotion manipulation. Furthermore, we introduce a normalizing flow-based motion generator pretrained on a large dataset with a wide-range motion to generate diverse head poses, blinks, and eyeball movements. Finally, we propose a personalized emotion-guided head generator with an emotion mapping network that can synthesize high-fidelity and faithful emotional video portraits. Both quantitative and qualitative experiments demonstrate our method outperforms previous methods in image quality, photo-realism, emotion accuracy, and motion diversity.
Paper Structure (35 sections, 18 equations, 12 figures, 7 tables)

This paper contains 35 sections, 18 equations, 12 figures, 7 tables.

Figures (12)

  • Figure 1: GMTalker. Given the driving speech and emotion label, our method can generate high-fidelity and faithful emotional talking video portraits with diverse motions. Emotions can be freely manipulated within our continuous and disentangled Gaussian mixture distributed latent space. Additionally, our method can also predict motions from the input speech, including head poses, eye blinks, and gaze.
  • Figure 2: Pipeline of GMTalker. Our framework consists of three parts: (a) In Section \ref{['GMVAE']}, given the input speech and emotion weights label, we propose GMEG to generate 3DMM expression coefficients sampling from Gaussian mixture latent space. (b) In Section \ref{['FVAE']}, we introduce NFMG to predict motion coefficients from the audio, including poses, eye blinks, and gaze. (c) In Section \ref{['StyleUNet']}, we render these coefficients to 3DMM renderings for the target person and then use an emotion-guided head generator with EMN to synthesize photo-realistic video portraits with personalized style.
  • Figure 3: The training process of our proposed GMEG and NFMG. (a) We autoregressively reconstruct facial expression coefficients $\hat{\beta}_{1:t}$ from input audio $a_{1:t}$ and emotion label $e$ by optimizing four loss: $\mathcal{L}_{rec}$, $\mathcal{L}_{cond}$, $\mathcal{L}_{w}$, $\mathcal{L}_{emo}$. (b) Our NFMG generates diverse motions $\hat{\rho}_{1:t}$ from audio, including head poses, eye blinks, and gaze, by learning a Transformer normalizing flow-based VAE.
  • Figure 4: Qualitative comparison for emotional talking video portraits on the two cases in the MEAD test dataset. The emotion categories of the videos are happy (left) and angry (right). The bottom row shows ground-truth frames. Since EAMM ji2022eamm and EAT are one-shot methods, we choose the same reference image used in EAT gan2023efficient to generate target videos for them.
  • Figure 5: Qualitative results of the emotion interpolation comparison. For PD-FGC wang2023progressive and Styletalk ma2023styletalk, we manipulate expressions between the source emotion video and the target emotion video. For EAT gan2023efficient, we achieve this by transitioning between the source emotion label and the target emotion label.
  • ...and 7 more figures