Table of Contents
Fetching ...

MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

Kim Sung-Bin, Lee Chae-Yeon, Gihun Son, Oh Hyun-Bin, Janghoon Ju, Suekyeong Nam, Tae-Hyun Oh

TL;DR

The paper tackles multilingual 3D talking head generation by introducing the MultiTalk dataset, a large-scale, multilingual 2D video corpus with pseudo-3D ground truth across 20 languages. It presents a two-stage baseline that learns a discrete facial-motion prior via VQ-VAE and then synthesizes speech-driven motions conditioned on multilingual speech representations and language-specific style embeddings. A new evaluation metric, Audio-Visual Lip Readability (AVLR), is proposed to quantify lip-sync accuracy in multilingual settings and aligns well with human judgments. While the dataset relies on pseudo-annotations, the approach demonstrates improved multilingual lip synchronization and realism compared to English-centric baselines, highlighting the practical potential for cross-language virtual avatars and related applications.

Abstract

Recent studies in speech-driven 3D talking head generation have achieved convincing results in verbal articulations. However, generating accurate lip-syncs degrades when applied to input speech in other languages, possibly due to the lack of datasets covering a broad spectrum of facial movements across languages. In this work, we introduce a novel task to generate 3D talking heads from speeches of diverse languages. We collect a new multilingual 2D video dataset comprising over 420 hours of talking videos in 20 languages. With our proposed dataset, we present a multilingually enhanced model that incorporates language-specific style embeddings, enabling it to capture the unique mouth movements associated with each language. Additionally, we present a metric for assessing lip-sync accuracy in multilingual settings. We demonstrate that training a 3D talking head model with our proposed dataset significantly enhances its multilingual performance. Codes and datasets are available at https://multi-talk.github.io/.

MultiTalk: Enhancing 3D Talking Head Generation Across Languages with Multilingual Video Dataset

TL;DR

The paper tackles multilingual 3D talking head generation by introducing the MultiTalk dataset, a large-scale, multilingual 2D video corpus with pseudo-3D ground truth across 20 languages. It presents a two-stage baseline that learns a discrete facial-motion prior via VQ-VAE and then synthesizes speech-driven motions conditioned on multilingual speech representations and language-specific style embeddings. A new evaluation metric, Audio-Visual Lip Readability (AVLR), is proposed to quantify lip-sync accuracy in multilingual settings and aligns well with human judgments. While the dataset relies on pseudo-annotations, the approach demonstrates improved multilingual lip synchronization and realism compared to English-centric baselines, highlighting the practical potential for cross-language virtual avatars and related applications.

Abstract

Recent studies in speech-driven 3D talking head generation have achieved convincing results in verbal articulations. However, generating accurate lip-syncs degrades when applied to input speech in other languages, possibly due to the lack of datasets covering a broad spectrum of facial movements across languages. In this work, we introduce a novel task to generate 3D talking heads from speeches of diverse languages. We collect a new multilingual 2D video dataset comprising over 420 hours of talking videos in 20 languages. With our proposed dataset, we present a multilingually enhanced model that incorporates language-specific style embeddings, enabling it to capture the unique mouth movements associated with each language. Additionally, we present a metric for assessing lip-sync accuracy in multilingual settings. We demonstrate that training a 3D talking head model with our proposed dataset significantly enhances its multilingual performance. Codes and datasets are available at https://multi-talk.github.io/.
Paper Structure (21 sections, 3 equations, 3 figures, 5 tables)

This paper contains 21 sections, 3 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: Samples of our MultiTalk dataset. Each 2D video is annotated with the language type and the pseudo transcript, and a subset of the videos further provides pseudo 3D mesh vertices.
  • Figure 2: Overall pipeline of MultiTalk. In stage 1, a codebook of discrete motions is learned from 3D facial meshes speaking in diverse languages. In stage 2, the model learns to autoregressively generate a sequence of motion representations from an input speech. These representations are quantized by the codebook, thereby synthesizing speech-driven 3D talking head.
  • Figure 3: Qualitative comparisons. Compared to existing methods, MultiTalk (Ours) demonstrates detailed facial expressions with accurately synchronized lip movements to the input speech.