Table of Contents
Fetching ...

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

Jisoo Kim, Jungbin Cho, Joonho Park, Soonmin Hwang, Da Eun Kim, Geon Kim, Youngjae Yu

TL;DR

DEEPTalk tackles the lack of emotional richness in speech-driven 3D face animation by introducing a Dynamic Emotion Embedding (DEE) that learns a probabilistic cross-modal space between speech and facial motion, and a Temporally Hierarchical VQ-VAE (TH-VQVAE) that serves as a expressive, multi-scale motion prior. The system non-autoregressively maps emotional speech to motion codebooks, guided by an emotion consistency loss to align emotion in speech with generated expressions. Extensive experiments on MEAD, CREMA-D, RAVDESS, HDTF, and Emo-Vox show state-of-the-art realism (FID/FFD), strong lip-sync (LSE-D/C), and superior emotional expressiveness (Emo-FID), with controllable diversity via alpha and codebook temperatures. The work demonstrates that combining a probabilistic cross-modal emotional space with a hierarchical discrete motion prior yields diverse, natural, and lip-synced emotional talking faces, advancing applications in virtual avatars and interactive agents.

Abstract

Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hindering their applicability. To address these challenges, we introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs. To achieve this, we first train DEE (Dynamic Emotion Embedding), which employs probabilistic contrastive learning to forge a joint emotion embedding space for both speech and facial motion. This probabilistic framework captures the uncertainty in interpreting emotions from speech and facial motion, enabling the derivation of emotion vectors from its multifaceted space. Moreover, to generate dynamic facial motion, we design TH-VQVAE (Temporally Hierarchical VQ-VAE) as an expressive and robust motion prior overcoming limitations of VAEs and VQ-VAEs. Utilizing these strong priors, we develop DEEPTalk, a talking head generator that non-autoregressively predicts codebook indices to create dynamic facial motion, incorporating a novel emotion consistency loss. Extensive experiments on various datasets demonstrate the effectiveness of our approach in creating diverse, emotionally expressive talking faces that maintain accurate lip-sync. Our project page is available at https://whwjdqls.github.io/deeptalk\_website/

DEEPTalk: Dynamic Emotion Embedding for Probabilistic Speech-Driven 3D Face Animation

TL;DR

DEEPTalk tackles the lack of emotional richness in speech-driven 3D face animation by introducing a Dynamic Emotion Embedding (DEE) that learns a probabilistic cross-modal space between speech and facial motion, and a Temporally Hierarchical VQ-VAE (TH-VQVAE) that serves as a expressive, multi-scale motion prior. The system non-autoregressively maps emotional speech to motion codebooks, guided by an emotion consistency loss to align emotion in speech with generated expressions. Extensive experiments on MEAD, CREMA-D, RAVDESS, HDTF, and Emo-Vox show state-of-the-art realism (FID/FFD), strong lip-sync (LSE-D/C), and superior emotional expressiveness (Emo-FID), with controllable diversity via alpha and codebook temperatures. The work demonstrates that combining a probabilistic cross-modal emotional space with a hierarchical discrete motion prior yields diverse, natural, and lip-synced emotional talking faces, advancing applications in virtual avatars and interactive agents.

Abstract

Speech-driven 3D facial animation has garnered lots of attention thanks to its broad range of applications. Despite recent advancements in achieving realistic lip motion, current methods fail to capture the nuanced emotional undertones conveyed through speech and produce monotonous facial motion. These limitations result in blunt and repetitive facial animations, reducing user engagement and hindering their applicability. To address these challenges, we introduce DEEPTalk, a novel approach that generates diverse and emotionally rich 3D facial expressions directly from speech inputs. To achieve this, we first train DEE (Dynamic Emotion Embedding), which employs probabilistic contrastive learning to forge a joint emotion embedding space for both speech and facial motion. This probabilistic framework captures the uncertainty in interpreting emotions from speech and facial motion, enabling the derivation of emotion vectors from its multifaceted space. Moreover, to generate dynamic facial motion, we design TH-VQVAE (Temporally Hierarchical VQ-VAE) as an expressive and robust motion prior overcoming limitations of VAEs and VQ-VAEs. Utilizing these strong priors, we develop DEEPTalk, a talking head generator that non-autoregressively predicts codebook indices to create dynamic facial motion, incorporating a novel emotion consistency loss. Extensive experiments on various datasets demonstrate the effectiveness of our approach in creating diverse, emotionally expressive talking faces that maintain accurate lip-sync. Our project page is available at https://whwjdqls.github.io/deeptalk\_website/
Paper Structure (52 sections, 18 equations, 17 figures, 6 tables)

This paper contains 52 sections, 18 equations, 17 figures, 6 tables.

Figures (17)

  • Figure 1: Overview of DEEPTalk. Starting with an emotional speech input (left), we extract probabilistic emotion embeddings (depicted as blobs), and sample from these embeddings to generate diverse emotional facial animations aligned with the input speech (right).
  • Figure 2: Overall Architecture of Our Method. (a) $E_{audio}$ and $E_{exp}$ are trained to predict mean and variance for a joint audio-facial emotion embedding space, DEE. (b) We train TH-VQVAE with separate codebooks, $\mathcal{Z}^b$ and $\mathcal{Z}^t$, for low and high-frequency motions, respectively. (c) DEEPTalk first extracts face features, predict top and bottom codebook indices, and use frozen TH-VQVAE decoders to decode the quantized motion features. To ensure emotion alignment between input audio and the predicted facial expressions, we introduce an emotional consistency loss $L_{emo}$ by utilizing DEE.
  • Figure 3: Qualitative results on MEAD test set. Each row displays the predicted facial motions for each utterance and corresponding emotion (left) generated by baseline models. Lip motion deviations from the ground truth are highlighted in red, while incorrect or neutral emotional expressions are indicated in purple. EMOTE, being conditioned on emotion labels, exhibits a high degree of emotional expressiveness. However, this conditioning sometimes results in exaggerated expressions, highlighted in pink in the enlarged images. In contrast, DEEPTalk generates natural emotional faces while maintaining accurate lip sync.
  • Figure 4: Embeddings are clustered by emotion categories.
  • Figure 5: User Study Results. Our method is preferred over most methods on emotional alignment and lip synchronization.
  • ...and 12 more figures