Table of Contents
Fetching ...

PASE: Phoneme-Aware Speech Encoder to Improve Lip Sync Accuracy for Talking Head Synthesis

Yihuan Huang, Jiajun Liu, Yanzhen Ren, Jun Xue, Wuyang Liu, Zongkun Sun

TL;DR

This work tackles phoneme–viseme alignment ambiguity in talking head synthesis by introducing PASE, a phoneme-aware speech encoder that can be plugged into existing NeRF- or 3DGS-based renderers. PASE employs an STFT-based audio pathway with a GRU, a multi-channel CNN for lip visuals, and a phoneme-level alignment module that fuses audio and visual features via cross-attention and contrastive learning, complemented by a prediction and reconstruction task for robustness. Empirical results show state-of-the-art lip-sync improvements, with 13.7% and 14.2% gains in lip-sync metrics over strong baselines on NeRF and 3DGS pipelines, respectively, while maintaining efficient inference. The encoder is designed to be easily integrated into diverse pipelines, enabling significant performance gains without architectural changes to the rendering models.

Abstract

Recent talking head synthesis works typically adopt speech features extracted from large-scale pre-trained acoustic models. However, the intrinsic many-to-many relationship between speech and lip motion causes phoneme-viseme alignment ambiguity, leading to inaccurate and unstable lips. To further improve lip sync accuracy, we propose PASE (Phoneme-Aware Speech Encoder), a novel speech representation model that bridges the gap between phonemes and visemes. PASE explicitly introduces phoneme embeddings as alignment anchors and employs a contrastive alignment module to enhance the discriminability between corresponding audio-visual pairs. In addition, a prediction and reconstruction task is designed to improve robustness under noise and partial modality absence. Experimental results show PASE significantly improves lip sync accuracy and achieves state-of-the-art performance across both NeRF- and 3DGS-based rendering frameworks, outperforming conventional methods based on acoustic features by 13.7 % and 14.2 %, respectively. Importantly, PASE can be seamlessly integrated into diverse talking head pipelines to improve the lip sync accuracy without architectural modifications.

PASE: Phoneme-Aware Speech Encoder to Improve Lip Sync Accuracy for Talking Head Synthesis

TL;DR

This work tackles phoneme–viseme alignment ambiguity in talking head synthesis by introducing PASE, a phoneme-aware speech encoder that can be plugged into existing NeRF- or 3DGS-based renderers. PASE employs an STFT-based audio pathway with a GRU, a multi-channel CNN for lip visuals, and a phoneme-level alignment module that fuses audio and visual features via cross-attention and contrastive learning, complemented by a prediction and reconstruction task for robustness. Empirical results show state-of-the-art lip-sync improvements, with 13.7% and 14.2% gains in lip-sync metrics over strong baselines on NeRF and 3DGS pipelines, respectively, while maintaining efficient inference. The encoder is designed to be easily integrated into diverse pipelines, enabling significant performance gains without architectural changes to the rendering models.

Abstract

Recent talking head synthesis works typically adopt speech features extracted from large-scale pre-trained acoustic models. However, the intrinsic many-to-many relationship between speech and lip motion causes phoneme-viseme alignment ambiguity, leading to inaccurate and unstable lips. To further improve lip sync accuracy, we propose PASE (Phoneme-Aware Speech Encoder), a novel speech representation model that bridges the gap between phonemes and visemes. PASE explicitly introduces phoneme embeddings as alignment anchors and employs a contrastive alignment module to enhance the discriminability between corresponding audio-visual pairs. In addition, a prediction and reconstruction task is designed to improve robustness under noise and partial modality absence. Experimental results show PASE significantly improves lip sync accuracy and achieves state-of-the-art performance across both NeRF- and 3DGS-based rendering frameworks, outperforming conventional methods based on acoustic features by 13.7 % and 14.2 %, respectively. Importantly, PASE can be seamlessly integrated into diverse talking head pipelines to improve the lip sync accuracy without architectural modifications.

Paper Structure

This paper contains 21 sections, 7 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The comparison of the talking head synthesis pipeline when using acoustic features and PASE. The core of PASE is to solve phoneme-viseme alignment ambiguity, which refers to the uncertainty and imprecision in matching phonemes (speech) with visemes (lip). PASE is an independent encoder that can be seamlessly integrated into various rendering models to enhance the quality of synthesized faces.
  • Figure 2: The illustration of the many-to-many relationship between speech and visual modalities. Similar lip shapes correspond to different phonemes, while the same phoneme corresponds to different lip shapes under different emotions.
  • Figure 3: The vocal tract anatomy diagram. The lips are only one of the organs that affect pronunciation.
  • Figure 4: Taking the phonemes /d/ and /t/ as examples to illustrate the phoneme-viseme alignment ambiguity. In the visualization of HuBERT hubert features, there is a noticeable difference between /d/ and /t/. In the visualization of PASE features, there is no significant distinction between /d/ and /t/ because they share similar lip shapes, indicating that PASE differentiates phonemes based on lip shapes rather than acoustic features.
  • Figure 5: The framework of PASE.
  • ...and 2 more figures