Table of Contents
Fetching ...

FT2TF: First-Person Statement Text-To-Talking Face Generation

Xingjian Diao, Ming Cheng, Wayner Barrios, SouYoung Jin

TL;DR

This work proposes FT2TF - First-Person Statement Text- To-Talking Face Generation, a novel one-stage end-to-end pipeline for talking face generation driven by first-person statement text that outperforms existing relevant methods and reaches the state-of-the-art.

Abstract

Talking face generation has gained immense popularity in the computer vision community, with various applications including AR, VR, teleconferencing, digital assistants, and avatars. Traditional methods are mainly audio-driven, which have to deal with the inevitable resource-intensive nature of audio storage and processing. To address such a challenge, we propose FT2TF - First-Person Statement Text-To-Talking Face Generation, a novel one-stage end-to-end pipeline for talking face generation driven by first-person statement text. Different from previous work, our model only leverages visual and textual information without any other sources (e.g., audio/landmark/pose) during inference. Extensive experiments are conducted on LRS2 and LRS3 datasets, and results on multi-dimensional evaluation metrics are reported. Both quantitative and qualitative results showcase that FT2TF outperforms existing relevant methods and reaches the state-of-the-art. This achievement highlights our model's capability to bridge first-person statements and dynamic face generation, providing insightful guidance for future work.

FT2TF: First-Person Statement Text-To-Talking Face Generation

TL;DR

This work proposes FT2TF - First-Person Statement Text- To-Talking Face Generation, a novel one-stage end-to-end pipeline for talking face generation driven by first-person statement text that outperforms existing relevant methods and reaches the state-of-the-art.

Abstract

Talking face generation has gained immense popularity in the computer vision community, with various applications including AR, VR, teleconferencing, digital assistants, and avatars. Traditional methods are mainly audio-driven, which have to deal with the inevitable resource-intensive nature of audio storage and processing. To address such a challenge, we propose FT2TF - First-Person Statement Text-To-Talking Face Generation, a novel one-stage end-to-end pipeline for talking face generation driven by first-person statement text. Different from previous work, our model only leverages visual and textual information without any other sources (e.g., audio/landmark/pose) during inference. Extensive experiments are conducted on LRS2 and LRS3 datasets, and results on multi-dimensional evaluation metrics are reported. Both quantitative and qualitative results showcase that FT2TF outperforms existing relevant methods and reaches the state-of-the-art. This achievement highlights our model's capability to bridge first-person statements and dynamic face generation, providing insightful guidance for future work.
Paper Structure (30 sections, 4 equations, 8 figures, 5 tables)

This paper contains 30 sections, 4 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: High-quality text-driven talking face generation. FT2TF aims to generate realistic talking faces using two inputs: (i) reference talking face frames and (ii) first-person statement text.
  • Figure 2: Overview of FT2TF Pipeline. The FT2TF pipeline employs two specialized Text Encoders, the Global Emotion Text Encoder, and the Linguistic Text Encoder, for extracting emotional and linguistic text features, respectively. Additionally, a Visual Encoder is utilized to extract visual features. Afterward, it leverages a Multi-Scale Cross-Attention Module for visual-textual fusion. The resulting visual-textual representations are fed to a Visual Decoder to synthesize talking face frames.
  • Figure 3: Visualization of Global and Local Cross-Attention. The red areas indicate high attention weights, while the blue areas indicate the opposite.
  • Figure 4: Efficiency comparison. Our analysis includes a comparison of SSIM scores against the total trainable parameters for various models, demonstrating our model's efficiency.
  • Figure 5: Qualitative comparison with state-of-the-art methods on LRS2 and LRS3. The three models on the left specialize in lip generation, whereas the others are designed to generate entire faces. TTFS jang2024faces and our model are text-driven, whereas the remaining methods are audio-driven. Our model consistently generates the most detailed and accurate talking faces across diverse roles, genders, and ages under different lighting and head-poses conditions.
  • ...and 3 more figures