EmoTalkingGaussian: Continuous Emotion-conditioned Talking Head Synthesis
Junuk Cha, Seongro Yoon, Valeriya Strizhkova, Francois Bremond, Seungryul Baek
TL;DR
EmoTalkingGaussian addresses the limited emotional expressiveness of state-of-the-art 3D Gaussian talking heads by conditioning emotion with continuous valence/arousal and ensuring lip synchronization with input audio. The method introduces a lip-aligned emotional face generator, a three-branch EmoTalkingGaussian architecture (inside-mouth, face, emotion) leveraging tri-plane hash encoders, and self-supervised sync training via a TTS-generated audio dataset with a SyncNet-based loss. Synthetic data augmentation, normal-map-based refinements, and a comprehensive training loss suite improve image quality, emotion representation, and lip-sync accuracy, outperforming prior methods on PSNR, SSIM, LPIPS, lip synchronization metrics, and emotion-consistency metrics. This approach enables robust, continuous emotion control for talking heads without requiring per-subject data collection, with potential for real-time, expressive avatars, while acknowledging biases and misuse risks and proposing ethical considerations.
Abstract
3D Gaussian splatting-based talking head synthesis has recently gained attention for its ability to render high-fidelity images with real-time inference speed. However, since it is typically trained on only a short video that lacks the diversity in facial emotions, the resultant talking heads struggle to represent a wide range of emotions. To address this issue, we propose a lip-aligned emotional face generator and leverage it to train our EmoTalkingGaussian model. It is able to manipulate facial emotions conditioned on continuous emotion values (i.e., valence and arousal); while retaining synchronization of lip movements with input audio. Additionally, to achieve the accurate lip synchronization for in-the-wild audio, we introduce a self-supervised learning method that leverages a text-to-speech network and a visual-audio synchronization network. We experiment our EmoTalkingGaussian on publicly available videos and have obtained better results than state-of-the-arts in terms of image quality (measured in PSNR, SSIM, LPIPS), emotion expression (measured in V-RMSE, A-RMSE, V-SA, A-SA, Emotion Accuracy), and lip synchronization (measured in LMD, Sync-E, Sync-C), respectively.
