Table of Contents
Fetching ...

FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models

Shivangi Aneja, Justus Thies, Angela Dai, Matthias Nießner

TL;DR

This work proposes a new latent diffusion model for this task, operating in the expression space of neural parametric head models, to synthesize audio-driven realistic head sequences and stands out in its ability to generate plausible motion sequences that can produce high-fidelity head animation coupled with the NPHM shape space.

Abstract

We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from input audio signal. To capture the expressive, detailed nature of human heads, including hair, ears, and finer-scale eye movements, we propose to couple speech signal with the latent space of neural parametric head models to create high-fidelity, temporally coherent motion sequences. We propose a new latent diffusion model for this task, operating in the expression space of neural parametric head models, to synthesize audio-driven realistic head sequences. In the absence of a dataset with corresponding NPHM expressions to audio, we optimize for these correspondences to produce a dataset of temporally-optimized NPHM expressions fit to audio-video recordings of people talking. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of volumetric human heads, representing a significant advancement in the field of audio-driven 3D animation. Notably, our approach stands out in its ability to generate plausible motion sequences that can produce high-fidelity head animation coupled with the NPHM shape space. Our experimental results substantiate the effectiveness of FaceTalk, consistently achieving superior and visually natural motion, encompassing diverse facial expressions and styles, outperforming existing methods by 75% in perceptual user study evaluation.

FaceTalk: Audio-Driven Motion Diffusion for Neural Parametric Head Models

TL;DR

This work proposes a new latent diffusion model for this task, operating in the expression space of neural parametric head models, to synthesize audio-driven realistic head sequences and stands out in its ability to generate plausible motion sequences that can produce high-fidelity head animation coupled with the NPHM shape space.

Abstract

We introduce FaceTalk, a novel generative approach designed for synthesizing high-fidelity 3D motion sequences of talking human heads from input audio signal. To capture the expressive, detailed nature of human heads, including hair, ears, and finer-scale eye movements, we propose to couple speech signal with the latent space of neural parametric head models to create high-fidelity, temporally coherent motion sequences. We propose a new latent diffusion model for this task, operating in the expression space of neural parametric head models, to synthesize audio-driven realistic head sequences. In the absence of a dataset with corresponding NPHM expressions to audio, we optimize for these correspondences to produce a dataset of temporally-optimized NPHM expressions fit to audio-video recordings of people talking. To the best of our knowledge, this is the first work to propose a generative approach for realistic and high-quality motion synthesis of volumetric human heads, representing a significant advancement in the field of audio-driven 3D animation. Notably, our approach stands out in its ability to generate plausible motion sequences that can produce high-fidelity head animation coupled with the NPHM shape space. Our experimental results substantiate the effectiveness of FaceTalk, consistently achieving superior and visually natural motion, encompassing diverse facial expressions and styles, outperforming existing methods by 75% in perceptual user study evaluation.
Paper Structure (38 sections, 33 equations, 15 figures, 5 tables)

This paper contains 38 sections, 33 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: Given an input speech signal, we propose a diffusion-based approach to synthesize high-quality and temporally consistent 3D motion sequences of high-fidelity human heads as neural parametric head models. Our method can generate a diverse set of expressions (including wrinkles and eye blinks) and the generated mouth motion is temporally synchronized with the given audio signal.
  • Figure 2: Pipeline Overview. FaceTalk uses frozen Wave2Vec 2.0 baevski2020wav2vec to extract audio embeddings from a speech signal. The diffusion timestamp is embedded using a timestamp embedder. The expression decoder employs a multi-head transformer decoder vaswani2023attention with FiLM perez2017film layers, interleaved between Self-Attention, Cross Attention, and FeedForward layers, to incorporate diffusion timestamp. During training, the model is trained to denoise the noisy expression sequences from timestamp $t$. At inference, FaceTalk denoises the gaussian noise sequence $\{ \boldsymbol{\theta}_{exp} \}_{T}^{1:N} \sim \mathcal{N}(0,\boldsymbol{{I}})$ iteratively until $t=0$, yielding the estimated final sequence $\{ \hat{\boldsymbol{\theta}}_{exp} \}^{1:N}$. These are then input to the frozen NPHM model, utilizing facial smoothing, and mesh sequences are extracted using MC marching_cubes.
  • Figure 3: Given the pointcloud sequence $\{ \mathcal{P}_{i} \}_{i=1}^{N}$ extracted from multi-view sequences from NeRSemble dataset kirschstein2023nersemble (bottom right), which also act as query points, we leverage the pretrained Expression MLP $\{\mathcal{E} \}$ to extract the expression deformations $\{ \boldsymbol{\delta}^{i}_{exp} \}^{1:N}$ and add them back to the input points to get the deformed points $\{ \mathcal{P}^{'}_{i} \}_{i=1}^{N}$. These points are then fed to the Identity MLP $\{\mathcal{I} \}$ which outputs the SDF. The expression codes $\{ \boldsymbol{\theta}_{{exp}}^{i} \}^{1:N}$ are optimized using overall loss $\mathcal{L}_{total}$. Note that both fixed identity code $\boldsymbol{\theta}_{{id}}$ and learnable expression codes $\{ \boldsymbol{\theta}_{{exp}}^{i} \}^{1:N}$ are fed to both $\{\mathcal{I} \}$ and $\{\mathcal{E} \}$. Once optimized, the meshes are then extracted with Marching Cubes marching_cubes.
  • Figure 4: Qualitative comparison for audio-driven face animation. Our approach maintains high fidelity while demonstrating rich mouth and nasolabial movements. In particular, we demonstrate more accurate lip articulation, precisely synchronized to phonetic movements.
  • Figure 5: Left. Expressions generated by our method can easily be applied to diverse identities with complex geometry. Right. Given a speech signal, our method can further generate diversity in expression for the same identity. Note the difference in speaking style (intensity of mouth opening) as well as eyeblinks/frowning in the upper face area.
  • ...and 10 more figures