Table of Contents
Fetching ...

Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis

Radek Daněček, Carolin Schmitt, Senya Polikovsky, Michael J. Black

TL;DR

THUNDER presents a diffusion-based, stochastic 3D talking head framework that achieves state-of-the-art lip-sync while preserving expressive diversity by introducing a mesh-to-speech M2S model and a novel analysis-by-audio-synthesis supervision loop. The M2S regresses audio representations from facial animation, enabling a cross-modal feedback signal that guides the diffusion model to produce lip-synced yet varied expressions. The approach delivers substantial lip-sync improvements across datasets and baselines, and can enhance other talking-head models when integrated as a supervisory signal. Overall, THUNDER demonstrates that audio-consistent, expressive 3D facial animation is achievable with a two-stage pipeline and self-supervised cross-modal supervision, with broad implications for realistic avatars and related 3D reconstruction tasks.

Abstract

In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync. To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the original input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync. To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talking avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop. Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations. The code and models will be available at https://thunder.is.tue.mpg.de/

Supervising 3D Talking Head Avatars with Analysis-by-Audio-Synthesis

TL;DR

THUNDER presents a diffusion-based, stochastic 3D talking head framework that achieves state-of-the-art lip-sync while preserving expressive diversity by introducing a mesh-to-speech M2S model and a novel analysis-by-audio-synthesis supervision loop. The M2S regresses audio representations from facial animation, enabling a cross-modal feedback signal that guides the diffusion model to produce lip-synced yet varied expressions. The approach delivers substantial lip-sync improvements across datasets and baselines, and can enhance other talking-head models when integrated as a supervisory signal. Overall, THUNDER demonstrates that audio-consistent, expressive 3D facial animation is achievable with a two-stage pipeline and self-supervised cross-modal supervision, with broad implications for realistic avatars and related 3D reconstruction tasks.

Abstract

In order to be widely applicable, speech-driven 3D head avatars must articulate their lips in accordance with speech, while also conveying the appropriate emotions with dynamically changing facial expressions. The key problem is that deterministic models produce high-quality lip-sync but without rich expressions, whereas stochastic models generate diverse expressions but with lower lip-sync quality. To get the best of both, we seek a stochastic model with accurate lip-sync. To that end, we develop a new approach based on the following observation: if a method generates realistic 3D lip motions, it should be possible to infer the spoken audio from the lip motion. The inferred speech should match the original input audio, and erroneous predictions create a novel supervision signal for training 3D talking head avatars with accurate lip-sync. To demonstrate this effect, we propose THUNDER (Talking Heads Under Neural Differentiable Elocution Reconstruction), a 3D talking head avatar framework that introduces a novel supervision mechanism via differentiable sound production. First, we train a novel mesh-to-speech model that regresses audio from facial animation. Then, we incorporate this model into a diffusion-based talking avatar framework. During training, the mesh-to-speech model takes the generated animation and produces a sound that is compared to the input speech, creating a differentiable analysis-by-audio-synthesis supervision loop. Our extensive qualitative and quantitative experiments demonstrate that THUNDER significantly improves the quality of the lip-sync of talking head avatars while still allowing for generation of diverse, high-quality, expressive facial animations. The code and models will be available at https://thunder.is.tue.mpg.de/

Paper Structure

This paper contains 40 sections, 11 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Mesh-to-speech architecture. It takes a sequence of mouth shapes as input, along with a speaker embedding feature, to produce the output speech units and spectrograms. These are used to compute a loss (top) and to produce the reconstructed audio (bottom) using a pretrained vocoder.
  • Figure 2: THUNDER architecture. The upper part of the figure (green) depicts the architecture of the diffusion model and the lower part (yellow) illustrates the application of the audio consistency loss. Gray boxes indicate the input to the system. Trainable components are highlighted in orange and frozen ones in blue.
  • Figure 3: Perceptual study of THUNDER. We compare THUNDER-F (-F for frozen backbone) with methods having both a trainable audio encoder (THUNDER-T w/o m2s) and frozen encoders (FlameFormer-F, THUNDER-F w/o m2s), and GT. The participants prefer THUNDER-F's lip-sync over that of the other models. Remarkably, the participants also have slight preference for THUNDER-F over GT, which suggests that the application of M2S helps THUNDER saturate the quality of GT.
  • Figure 4: Qualitative comparison on THUNDERSET. This figure shows the comparison between baselines, our model and GT for selected utterances. Note that DiffPoseTalk was trained on TFHP. Supplemental PDF and video contain more qualitative comparisons.
  • Figure 5: Media2Face conditioning images. These images were used as the conditions for the Media2Face* disentanglement experiment in Tab. \ref{['tab:disentanglement']}. The images were selected out from the RAVDESS test set. Top row from left to right: happy, angry, sad, disgusted. Bottom row: fearful, surprised, calm.
  • ...and 13 more figures