Table of Contents
Fetching ...

LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning

Kang Yang, Yifan Liang, Fangkun Liu, Zhenping Xie, Chengshi Zheng

TL;DR

This paper tackles Mandarin lip-to-speech synthesis by addressing viseme-to-phoneme ambiguity and the critical role of lexical tones. It introduces LTA-L2S, which leverages cross-lingual transfer from English audio-visual SSL models (AV-HuBERT) and a flow-matching F0 predictor guided by ASR-finetuned SSL speech units to model tones, combined with a two-stage training regime and a flow-based postnet for spectral refinement. The approach achieves state-of-the-art or competitive performance on the CN-CVS Mandarin dataset, with strong improvements in intelligibility, tonal accuracy, and speaker similarity, validated by both objective and subjective evaluations. The work demonstrates the practicality of cross-lingual knowledge transfer and flow-matching techniques for tonal languages and lays groundwork for extending to other Chinese dialects and accents.

Abstract

Lip-to-speech (L2S) synthesis for Mandarin is a significant challenge, hindered by complex viseme-to-phoneme mappings and the critical role of lexical tones in intelligibility. To address this issue, we propose Lexical Tone-Aware Lip-to-Speech (LTA-L2S). To tackle viseme-to-phoneme complexity, our model adapts an English pre-trained audio-visual self-supervised learning (SSL) model via a cross-lingual transfer learning strategy. This strategy not only transfers universal knowledge learned from extensive English data to the Mandarin domain but also circumvents the prohibitive cost of training such a model from scratch. To specifically model lexical tones and enhance intelligibility, we further employ a flow-matching model to generate the F0 contour. This generation process is guided by ASR-fine-tuned SSL speech units, which contain crucial suprasegmental information. The overall speech quality is then elevated through a two-stage training paradigm, where a flow-matching postnet refines the coarse spectrogram from the first stage. Extensive experiments demonstrate that LTA-L2S significantly outperforms existing methods in both speech intelligibility and tonal accuracy.

LTA-L2S: Lexical Tone-Aware Lip-to-Speech Synthesis for Mandarin with Cross-Lingual Transfer Learning

TL;DR

This paper tackles Mandarin lip-to-speech synthesis by addressing viseme-to-phoneme ambiguity and the critical role of lexical tones. It introduces LTA-L2S, which leverages cross-lingual transfer from English audio-visual SSL models (AV-HuBERT) and a flow-matching F0 predictor guided by ASR-finetuned SSL speech units to model tones, combined with a two-stage training regime and a flow-based postnet for spectral refinement. The approach achieves state-of-the-art or competitive performance on the CN-CVS Mandarin dataset, with strong improvements in intelligibility, tonal accuracy, and speaker similarity, validated by both objective and subjective evaluations. The work demonstrates the practicality of cross-lingual knowledge transfer and flow-matching techniques for tonal languages and lays groundwork for extending to other Chinese dialects and accents.

Abstract

Lip-to-speech (L2S) synthesis for Mandarin is a significant challenge, hindered by complex viseme-to-phoneme mappings and the critical role of lexical tones in intelligibility. To address this issue, we propose Lexical Tone-Aware Lip-to-Speech (LTA-L2S). To tackle viseme-to-phoneme complexity, our model adapts an English pre-trained audio-visual self-supervised learning (SSL) model via a cross-lingual transfer learning strategy. This strategy not only transfers universal knowledge learned from extensive English data to the Mandarin domain but also circumvents the prohibitive cost of training such a model from scratch. To specifically model lexical tones and enhance intelligibility, we further employ a flow-matching model to generate the F0 contour. This generation process is guided by ASR-fine-tuned SSL speech units, which contain crucial suprasegmental information. The overall speech quality is then elevated through a two-stage training paradigm, where a flow-matching postnet refines the coarse spectrogram from the first stage. Extensive experiments demonstrate that LTA-L2S significantly outperforms existing methods in both speech intelligibility and tonal accuracy.

Paper Structure

This paper contains 17 sections, 5 equations, 1 figure, 4 tables.

Figures (1)

  • Figure 1: The overall architecture of our proposed LTA-L2S model. The left panel illustrates the main synthesis network, which processes visual features to generate a coarse mel-spectrogram. This coarse spectrogram is subsequently refined by the flow-matching postnet shown in the middle panel. The internal structure of the postnet's DiT block is detailed on the right.