Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models
Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro
TL;DR
The paper addresses editable talking-face synthesis by converting text into audio latent representations that feed into pre-trained audio-driven face synthesis models. It introduces Text-to-Audio Embedding Module (TAEM), combining a phoneme-aware encoder, a duration predictor, and a speech refine module, augmented with a visual speaker embedding to capture identity. TAEM maps text to the audio latent space so text and audio inputs yield comparable lip-synced videos, outperforming cascaded text-to-speech approaches and generalizing to multiple models. The approach enables flexible, in-the-wild text-driven video generation without the need to train a separate text-driven model from scratch.
Abstract
In this paper, we present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner. Consequently, we can easily generate face videos that articulate the provided textual sentences, eliminating the necessity of recording speech for each inference, as required in the audio-driven model. To this end, we propose to embed the input text into the learned audio latent space of the pre-trained audio-driven model, while preserving the face synthesis capability of the original pre-trained model. Specifically, we devise a Text-to-Audio Embedding Module (TAEM) which maps a given text input into the audio latent space by modeling pronunciation and duration characteristics. Furthermore, to consider the speaker characteristics in audio while using text inputs, TAEM is designed to accept a visual speaker embedding. The visual speaker embedding is derived from a single target face image and enables improved mapping of input text to the learned audio latent space by incorporating the speaker characteristics inherent in the audio. The main advantages of the proposed framework are that 1) it can be applied to diverse audio-driven talking face synthesis models and 2) we can generate talking face videos with either text inputs or audio inputs with high flexibility.
