Table of Contents
Fetching ...

Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis

Dogucan Yaman, Seymanur Akti, Fevziye Irem Eyiokur, Alexander Waibel

TL;DR

The paper tackles the problem of generating synchronized talking-face video and speech from text, addressing the misalignment issues of cascaded pipelines. It introduces a joint latent-space framework where text is converted into Wav2Vec2 latent features through a Text-to-Vec (TTV) module to condition both speech synthesis (via HierSpeech++) and talking-face generation, coupled with a two-stage training scheme that bridges the gap between real and TTS-predicted features. Key contributions include the first joint text-to-audio-visual synthesis setup for face dubbing, and empirical evidence that the joint latent space improves lip synchronization and visual realism while eliminating the need for ground-truth audio during inference. The approach demonstrates competitive performance against state-of-the-art methods and offers a practical pathway for end-to-end text-driven audiovisual generation, though it notes limitations in generalization to unseen languages and subtle facial expressions beyond lip motion.

Abstract

We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.

Shared Latent Representation for Joint Text-to-Audio-Visual Synthesis

TL;DR

The paper tackles the problem of generating synchronized talking-face video and speech from text, addressing the misalignment issues of cascaded pipelines. It introduces a joint latent-space framework where text is converted into Wav2Vec2 latent features through a Text-to-Vec (TTV) module to condition both speech synthesis (via HierSpeech++) and talking-face generation, coupled with a two-stage training scheme that bridges the gap between real and TTS-predicted features. Key contributions include the first joint text-to-audio-visual synthesis setup for face dubbing, and empirical evidence that the joint latent space improves lip synchronization and visual realism while eliminating the need for ground-truth audio during inference. The approach demonstrates competitive performance against state-of-the-art methods and offers a practical pathway for end-to-end text-driven audiovisual generation, though it notes limitations in generalization to unseen languages and subtle facial expressions beyond lip motion.

Abstract

We propose a text-to-talking-face synthesis framework leveraging latent speech representations from HierSpeech++. A Text-to-Vec module generates Wav2Vec2 embeddings from text, which jointly condition speech and face generation. To handle distribution shifts between clean and TTS-predicted features, we adopt a two-stage training: pretraining on Wav2Vec2 embeddings and finetuning on TTS outputs. This enables tight audio-visual alignment, preserves speaker identity, and produces natural, expressive speech and synchronized facial motion without ground-truth audio at inference. Experiments show that conditioning on TTS-predicted latent features outperforms cascaded pipelines, improving both lip-sync and visual realism.

Paper Structure

This paper contains 8 sections, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Joint text-to-audio-visual synthesis framework. Text is converted to latent Wav2Vec2 features via TTV, which condition both speech synthesis and talking-face generation for synchronized output without ground-truth audio.
  • Figure 2: Qualitative comparison of our model with other approaches. Note that since our model is trained with predicted Wav2Vec2 features and is designed to align lips with TTS-generated audio in a joint space, the expected lip shapes do not necessarily need to match those of the GT.