A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages

Zibo Su; Kun Wei; Jiahua Li; Xu Yang; Cheng Deng

A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages

Zibo Su, Kun Wei, Jiahua Li, Xu Yang, Cheng Deng

TL;DR

MuEx tackles the problem of multilingual speech-driven talking face synthesis by introducing language-agnostic phoneme–viseme representations and a pseudo-phoneme guided mixture-of-experts router. The Phoneme–Viseme Alignment (PV-Align) builds cross-language correspondences through a discrete prototype space and a mutual-information objective, while the Pseudo-Phoneme Guided Expert Routing enables language-agnostic decision making without explicit language supervision. The approach is evaluated on the multilingual MTBF benchmark (12 languages, 95.04 hours) using novel metrics LSE-D and TMDC, plus standard FVD and Sync-C, and shows strong zero-shot generalization to unseen languages with state-of-the-art performance. The work provides a new paradigm for multilingual TFS, bridging audio and video via universal articulatory units and achieving realistic lip-sync and expressive dynamics across diverse languages. It also introduces the MTBF dataset to enable rigorous cross-language evaluation of TFS systems.

Abstract

Speech-driven talking face synthesis (TFS) focuses on generating lifelike facial animations from audio input. Current TFS models perform well in English but unsatisfactorily in non-English languages, producing wrong mouth shapes and rigid facial expressions. The terrible performance is caused by the English-dominated training datasets and the lack of cross-language generalization abilities. Thus, we propose Multilingual Experts (MuEx), a novel framework featuring a Phoneme-Guided Mixture-of-Experts (PG-MoE) architecture that employs phonemes and visemes as universal intermediaries to bridge audio and video modalities, achieving lifelike multilingual TFS. To alleviate the influence of linguistic differences and dataset bias, we extract audio and video features as phonemes and visemes respectively, which are the basic units of speech sounds and mouth movements. To address audiovisual synchronization issues, we introduce the Phoneme-Viseme Alignment Mechanism (PV-Align), which establishes robust cross-modal correspondences between phonemes and visemes. In addition, we build a Multilingual Talking Face Benchmark (MTFB) comprising 12 diverse languages with 95.04 hours of high-quality videos for training and evaluating multilingual TFS performance. Extensive experiments demonstrate that MuEx achieves superior performance across all languages in MTFB and exhibits effective zero-shot generalization to unseen languages without additional training.

A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages

TL;DR

Abstract

A Bridge from Audio to Video: Phoneme-Viseme Alignment Allows Every Face to Speak Multiple Languages

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)