Table of Contents
Fetching ...

JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation

Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras

TL;DR

JEAN presents a NeRF based framework for talking face generation that jointly handles lip synchronization to unseen audio and expressive facial transfer while preserving identity. It introduces a self supervised audio encoder and a transformer based expression disentangler to separate speech related lip motion from full face motion, and conditions a dynamic NeRF on these representations for expressive, lip synchronized rendering. Trained on monocular MEAD videos, JEAN achieves state of the art in both expression transfer fidelity and lip synchronization, with ablations confirming the necessity of audio lip alignment and expression disentanglement. The approach offers a scalable path toward high fidelity, controllable talking faces and can be extended to other neural rendering pipelines.

Abstract

We introduce a novel method for joint expression and audio-guided talking face generation. Recent approaches either struggle to preserve the speaker identity or fail to produce faithful facial expressions. To address these challenges, we propose a NeRF-based network. Since we train our network on monocular videos without any ground truth, it is essential to learn disentangled representations for audio and expression. We first learn audio features in a self-supervised manner, given utterances from multiple subjects. By incorporating a contrastive learning technique, we ensure that the learned audio features are aligned to the lip motion and disentangled from the muscle motion of the rest of the face. We then devise a transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from the speech-specific mouth movements. Through quantitative and qualitative evaluation, we demonstrate that our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer along with lip synchronization to unseen audio.

JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation

TL;DR

JEAN presents a NeRF based framework for talking face generation that jointly handles lip synchronization to unseen audio and expressive facial transfer while preserving identity. It introduces a self supervised audio encoder and a transformer based expression disentangler to separate speech related lip motion from full face motion, and conditions a dynamic NeRF on these representations for expressive, lip synchronized rendering. Trained on monocular MEAD videos, JEAN achieves state of the art in both expression transfer fidelity and lip synchronization, with ablations confirming the necessity of audio lip alignment and expression disentanglement. The approach offers a scalable path toward high fidelity, controllable talking faces and can be extended to other neural rendering pipelines.

Abstract

We introduce a novel method for joint expression and audio-guided talking face generation. Recent approaches either struggle to preserve the speaker identity or fail to produce faithful facial expressions. To address these challenges, we propose a NeRF-based network. Since we train our network on monocular videos without any ground truth, it is essential to learn disentangled representations for audio and expression. We first learn audio features in a self-supervised manner, given utterances from multiple subjects. By incorporating a contrastive learning technique, we ensure that the learned audio features are aligned to the lip motion and disentangled from the muscle motion of the rest of the face. We then devise a transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from the speech-specific mouth movements. Through quantitative and qualitative evaluation, we demonstrate that our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer along with lip synchronization to unseen audio.
Paper Structure (35 sections, 6 equations, 11 figures, 2 tables)

This paper contains 35 sections, 6 equations, 11 figures, 2 tables.

Figures (11)

  • Figure 1: We introduce JEAN, a novel NeRF-based method that simultaneously combines lip-syncing to a target audio with facial expression transfer to generate talking faces.
  • Figure 2: (a) illustrates an overview of JEAN, a novel method for joint expression and audio-guided NeRF-based talking face generation. (b) and (c) illustrate our proposed self-supervised learning of our audio representation. Specifically, (b) demonstrates the self-supervised learning of our landmark autoencoder that disentangles lip motion from the motion of the rest of the face. Then, in (c), our audio encoder $AE$ is trained using a contrastive learning regime, in order to align audio features to lip motion.
  • Figure 3: Expression Transformer. We propose an expression transformer encoder that learns to disentangle facial expressions from speech-specific lip motion. We extract emotion features and disentangle them into expression content and speech-specific lip motion content.
  • Figure 4: Talking face generation guided by target expression and audio sources (1st column). We compare with state-of-the-art methods for expression and audio-driven talking face generation (EAMM jieamm, PD-FGC wang2022pdfgc), categorical emotion based talking face generation (EAT EAT_gen), as well as the audio-only AD-NeRF guo2021adnerf, and expression-only NeRFace nerface. Our method outperforms all these methods, transferring the expression and audio inputs with higher fidelity, while preserving the target identity.
  • Figure 5: Additional analysis that shows that the expression encoder disentangles features that are semantically grounded and well-behaved. Interpolation of features between different emotional expressions leads to semantically meaningful expressions.
  • ...and 6 more figures