Table of Contents
Fetching ...

Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space

Aashish Chandra, Aashutosh A, Abhijit Das

TL;DR

The paper tackles the challenge of real-time, synchronized audiovisual talking-face generation from a static image, a voice profile, and a text prompt. It introduces Narrating For You, a three-phase architecture with a multi-entangled latent space that tightly couples text, audio, and video via dual Transformer encoders and diffusion-based decoders. Through comprehensive experiments across four datasets and extensive ablations, the approach achieves superior audiovisual fidelity and lip-sync accuracy while generalizing across identities and datasets. The work advances practical, identity-preserving multimodal synthesis and highlights the need for ethical guidelines to govern such powerful generative capabilities.

Abstract

We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.

Narrating For You: Prompt-guided Audio-visual Narrating Face Generation Employing Multi-entangled Latent Space

TL;DR

The paper tackles the challenge of real-time, synchronized audiovisual talking-face generation from a static image, a voice profile, and a text prompt. It introduces Narrating For You, a three-phase architecture with a multi-entangled latent space that tightly couples text, audio, and video via dual Transformer encoders and diffusion-based decoders. Through comprehensive experiments across four datasets and extensive ablations, the approach achieves superior audiovisual fidelity and lip-sync accuracy while generalizing across identities and datasets. The work advances practical, identity-preserving multimodal synthesis and highlights the need for ethical guidelines to govern such powerful generative capabilities.

Abstract

We present a novel approach for generating realistic speaking and talking faces by synthesizing a person's voice and facial movements from a static image, a voice profile, and a target text. The model encodes the prompt/driving text, the driving image, and the voice profile of an individual and then combines them to pass them to the multi-entangled latent space to foster key-value pairs and queries for the audio and video modality generation pipeline. The multi-entangled latent space is responsible for establishing the spatiotemporal person-specific features between the modalities. Further, entangled features are passed to the respective decoder of each modality for output audio and video generation.
Paper Structure (13 sections, 2 equations, 4 figures, 7 tables)

This paper contains 13 sections, 2 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: Our Network Architecture: Text Prompt-guided joint audio-visual learning representations using dual stream Transformer Encoders and Denoising Diffusion model. The model architecture can be divided into three phases -- namely Encoding Phase, Multi-Latent Entanglement, and Decoding Phase. As an output, an audio-visual animation is generated from a single source image, reference audio, and a short text prompt.
  • Figure 2: Visual comparison on VoxCeleb, in the order: Ground Truth, Ours, Audio2Head, EAT, Hallo, and SadTalker. Columns represent 25s intervals.
  • Figure 3: Results of our model on FakeAVCeleb, Celeb-HQ and HDTF datasets.
  • Figure 4: Ground Truth vs. Generated Audio Spectrograms for (a) VoxCeleb, (b) CelebV-HQ, (c) FakeAVCeleb and (d) HDTF datasets