Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space
Aashish Chandra, Aashutosh A, Abhijit Das
TL;DR
The paper tackles the challenge of real-time, synchronized audiovisual talking-face generation from a static image, a voice profile, and a text prompt. It introduces Narrating For You, a three-phase architecture with a multi-entangled latent space that tightly couples text, audio, and video via dual Transformer encoders and diffusion-based decoders. Through comprehensive experiments across four datasets and extensive ablations, the approach achieves superior audiovisual fidelity and lip-sync accuracy while generalizing across identities and datasets. The work advances practical, identity-preserving multimodal synthesis and highlights the need for ethical guidelines to govern such powerful generative capabilities.
Abstract
We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.
