Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

Jeongsoo Choi; Minsu Kim; Se Jin Park; Yong Man Ro

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

Jeongsoo Choi, Minsu Kim, Se Jin Park, Yong Man Ro

TL;DR

The paper addresses editable talking-face synthesis by converting text into audio latent representations that feed into pre-trained audio-driven face synthesis models. It introduces Text-to-Audio Embedding Module (TAEM), combining a phoneme-aware encoder, a duration predictor, and a speech refine module, augmented with a visual speaker embedding to capture identity. TAEM maps text to the audio latent space so text and audio inputs yield comparable lip-synced videos, outperforming cascaded text-to-speech approaches and generalizing to multiple models. The approach enables flexible, in-the-wild text-driven video generation without the need to train a separate text-driven model from scratch.

Abstract

In this paper, we present a method for reprogramming pre-trained audio-driven talking face synthesis models to operate in a text-driven manner. Consequently, we can easily generate face videos that articulate the provided textual sentences, eliminating the necessity of recording speech for each inference, as required in the audio-driven model. To this end, we propose to embed the input text into the learned audio latent space of the pre-trained audio-driven model, while preserving the face synthesis capability of the original pre-trained model. Specifically, we devise a Text-to-Audio Embedding Module (TAEM) which maps a given text input into the audio latent space by modeling pronunciation and duration characteristics. Furthermore, to consider the speaker characteristics in audio while using text inputs, TAEM is designed to accept a visual speaker embedding. The visual speaker embedding is derived from a single target face image and enables improved mapping of input text to the learned audio latent space by incorporating the speaker characteristics inherent in the audio. The main advantages of the proposed framework are that 1) it can be applied to diverse audio-driven talking face synthesis models and 2) we can generate talking face videos with either text inputs or audio inputs with high flexibility.

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

TL;DR

Abstract

Paper Structure (18 sections, 1 equation, 3 figures, 2 tables)

This paper contains 18 sections, 1 equation, 3 figures, 2 tables.

Introduction
Proposed Method
Baseline Audio-driven Model
Text-to-Audio Embedding Module (TAEM)
Face Embedding and Phoneme Encoder
Duration Predictor and Length Regulator
Speech Refine Module
Objective Functions
Experimental Setup
Dataset
Baseline Methods
Evaluation Metrics
Implementation Details
Experimental Results
Generation Quality Comparison
...and 3 more sections

Figures (3)

Figure 1: Overview of the proposed text-driven talking face synthesis framework which reprograms the audio-driven models. (a) In the training stage, TAEM learns to embed the text representation into the audio latent space. (b) In the inference stage, we can generate a talking face video by inserting either a text or audio containing desired speech content.
Figure 2: The detailed architecture of the proposed TAEM.
Figure 3: Qualitative results comparison on LRS2 dataset.

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

TL;DR

Abstract

Text-driven Talking Face Synthesis by Reprogramming Audio-driven Models

Authors

TL;DR

Abstract

Table of Contents

Figures (3)