Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis
Dogucan Yaman, Seymanur Akti, Fevziye Irem Eyiokur, Alexander Waibel
TL;DR
The paper tackles the problem of generating synchronized talking-face video and speech from text, addressing the misalignment issues of cascaded pipelines. It introduces a joint latent-space framework where text is converted into Wav2Vec2 latent features through a Text-to-Vec (TTV) module to condition both speech synthesis (via HierSpeech++) and talking-face generation, coupled with a two-stage training scheme that bridges the gap between real and TTS-predicted features. Key contributions include the first joint text-to-audio-visual synthesis setup for face dubbing, and empirical evidence that the joint latent space improves lip synchronization and visual realism while eliminating the need for ground-truth audio during inference. The approach demonstrates competitive performance against state-of-the-art methods and offers a practical pathway for end-to-end text-driven audiovisual generation, though it notes limitations in generalization to unseen languages and subtle facial expressions beyond lip motion.
Abstract
We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.
