Table of Contents
Fetching ...

EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis

Junuk Cha, Seongro Yoon, Valeriya Strizhkova, Francois Bremond, Seungryul Baek

TL;DR

EmoTalkingGaussian addresses the limited emotional expressiveness of state-of-the-art 3D Gaussian talking heads by conditioning emotion with continuous valence/arousal and ensuring lip synchronization with input audio. The method introduces a lip-aligned emotional face generator, a three-branch EmoTalkingGaussian architecture (inside-mouth, face, emotion) leveraging tri-plane hash encoders, and self-supervised sync training via a TTS-generated audio dataset with a SyncNet-based loss. Synthetic data augmentation, normal-map-based refinements, and a comprehensive training loss suite improve image quality, emotion representation, and lip-sync accuracy, outperforming prior methods on PSNR, SSIM, LPIPS, lip synchronization metrics, and emotion-consistency metrics. This approach enables robust, continuous emotion control for talking heads without requiring per-subject data collection, with potential for real-time, expressive avatars, while acknowledging biases and misuse risks and proposing ethical considerations.

Abstract

3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional face generator and leverage it to train our EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned on continuous emotion values (i.e., valence and arousal); while retaining synchronization of lip movements with input audio. Additionally, to achieve the accurate lip synchronization for in-the-wild audio, we introduce a self-supervised learning method that leverages a text-to-speech network and a visual-audio synchronization network. We experiment our EmoTalkingGaussian on publicly available videos and have obtained better results than state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS), emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy), and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively.

EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis

TL;DR

EmoTalkingGaussian addresses the limited emotional expressiveness of state-of-the-art 3D Gaussian talking heads by conditioning emotion with continuous valence/arousal and ensuring lip synchronization with input audio. The method introduces a lip-aligned emotional face generator, a three-branch EmoTalkingGaussian architecture (inside-mouth, face, emotion) leveraging tri-plane hash encoders, and self-supervised sync training via a TTS-generated audio dataset with a SyncNet-based loss. Synthetic data augmentation, normal-map-based refinements, and a comprehensive training loss suite improve image quality, emotion representation, and lip-sync accuracy, outperforming prior methods on PSNR, SSIM, LPIPS, lip synchronization metrics, and emotion-consistency metrics. This approach enables robust, continuous emotion control for talking heads without requiring per-subject data collection, with potential for real-time, expressive avatars, while acknowledging biases and misuse risks and proposing ethical considerations.

Abstract

3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional face generator and leverage it to train our EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned on continuous emotion values (i.e., valence and arousal); while retaining synchronization of lip movements with input audio. Additionally, to achieve the accurate lip synchronization for in-the-wild audio, we introduce a self-supervised learning method that leverages a text-to-speech network and a visual-audio synchronization network. We experiment our EmoTalkingGaussian on publicly available videos and have obtained better results than state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS), emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy), and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively.

Paper Structure

This paper contains 38 sections, 31 equations, 18 figures, 5 tables.

Figures (18)

  • Figure 1: The state-of-the-art 3D talking head synthesis method, TalkingGaussian li2024talkinggaussian, manipulates expressions based on action units ekman1978facial; however, its ability to express diverse emotions is limited, and the image quality becomes inferior when representing unseen emotional expression of the emotion source image pexels. Our method can reflect diverse expressions and emotions based on action units as well as valence/arousal russell1980circumplex, and it renders the talking head with lip shape well-aligned to the input audio (/ni/ and < mute >), as shown in the left panel. The right panel demonstrates our method's capability to convey continuous emotions through valence/arousal adjustments, while keeping the lip synchronized to the audio. The "ce" in "nice," which the speaker is pronouncing, is highlighted in red.
  • Figure 2: (a) shows the source image, (b) and (c) represent images for the 'happy&surprise' emotion (valence of 0.8, arousal of 0.6), which are generated by EmoStyle azari2024emostyle and our lip-aligned emotional face generator, respectively.
  • Figure 3: Overview of the EmoTalkingGaussian: Our EmoTalkingGaussian is composed of three branches. First, the inside-mouth branch estimates the position offsets of 3D Gaussians based on audio features $\mathbf{a}$. Second, the face branch estimates the position, scaling factor, and quaternion offsets based on audio features $\mathbf{a}$ and action units $\mathbf{u}$. Our inside-mouth branch and face branch are inherited from TalkingGaussian li2024talkinggaussian, indicated by the dashed rectangle. Finally, the third branch, the emotion branch, estimates the position, scaling factor, and quaternion offsets based on emotion inputs $\mathbf{e}$ (valence/arousal). We render the mouth region and face region $\hat{I}$ along the black arrow. Then, we render the mouth region and emotional face region $\hat{I}^\text{E}$ along the yellow arrow. We apply RGB loss, normal loss, along with audio and lip synchronization loss to improve visual fidelity and overall alignment.
  • Figure 4: We present qualitative comparisons with other baselines, including ER-NeRF li2023efficient, GaussianTalker cho2024gaussiantalker, and TalkingGaussian li2024talkinggaussian. The word is displayed with the spoken word highlighted in red. The last sample shows the phonetic transcription. 'V' and 'A' stand for valence and arousal, and emotion labels indicate the emotion that 'V' and 'A' values represent. Emotional inconsistencies and lip mismatches are highlighted with blue and brown dashed boxes, respectively.
  • Figure S1: Overview of the lip-aligned emotional face generator. While EmoStyle azari2024emostyle cannot produce lip-aligned emotional facial images, our generator creates such images by aligning lips based on lip heatmaps. $\boxplus$ denotes vector summation.
  • ...and 13 more figures