JEAN: Joint Expression and Audio-guided NeRF-based Talking Face Generation
Sai Tanmay Reddy Chakkera, Aggelina Chatziagapi, Dimitris Samaras
TL;DR
JEAN presents a NeRF based framework for talking face generation that jointly handles lip synchronization to unseen audio and expressive facial transfer while preserving identity. It introduces a self supervised audio encoder and a transformer based expression disentangler to separate speech related lip motion from full face motion, and conditions a dynamic NeRF on these representations for expressive, lip synchronized rendering. Trained on monocular MEAD videos, JEAN achieves state of the art in both expression transfer fidelity and lip synchronization, with ablations confirming the necessity of audio lip alignment and expression disentanglement. The approach offers a scalable path toward high fidelity, controllable talking faces and can be extended to other neural rendering pipelines.
Abstract
We introduce a novel method for joint expression and audio-guided talking face generation. Recent approaches either struggle to preserve the speaker identity or fail to produce faithful facial expressions. To address these challenges, we propose a NeRF-based network. Since we train our network on monocular videos without any ground truth, it is essential to learn disentangled representations for audio and expression. We first learn audio features in a self-supervised manner, given utterances from multiple subjects. By incorporating a contrastive learning technique, we ensure that the learned audio features are aligned to the lip motion and disentangled from the muscle motion of the rest of the face. We then devise a transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from the speech-specific mouth movements. Through quantitative and qualitative evaluation, we demonstrate that our method can synthesize high-fidelity talking face videos, achieving state-of-the-art facial expression transfer along with lip synchronization to unseen audio.
