PASE: Phoneme-Aware Speech Encoder to Improve Lip Sync Accuracy for Talking Head Synthesis
Yihuan Huang, Jiajun Liu, Yanzhen Ren, Jun Xue, Wuyang Liu, Zongkun Sun
TL;DR
This work tackles phoneme–viseme alignment ambiguity in talking head synthesis by introducing PASE, a phoneme-aware speech encoder that can be plugged into existing NeRF- or 3DGS-based renderers. PASE employs an STFT-based audio pathway with a GRU, a multi-channel CNN for lip visuals, and a phoneme-level alignment module that fuses audio and visual features via cross-attention and contrastive learning, complemented by a prediction and reconstruction task for robustness. Empirical results show state-of-the-art lip-sync improvements, with 13.7% and 14.2% gains in lip-sync metrics over strong baselines on NeRF and 3DGS pipelines, respectively, while maintaining efficient inference. The encoder is designed to be easily integrated into diverse pipelines, enabling significant performance gains without architectural changes to the rendering models.
Abstract
Recent talking head synthesis works typically adopt speech features extracted from large-scale pre-trained acoustic models. However, the intrinsic many-to-many relationship between speech and lip motion causes phoneme-viseme alignment ambiguity, leading to inaccurate and unstable lips. To further improve lip sync accuracy, we propose PASE (Phoneme-Aware Speech Encoder), a novel speech representation model that bridges the gap between phonemes and visemes. PASE explicitly introduces phoneme embeddings as alignment anchors and employs a contrastive alignment module to enhance the discriminability between corresponding audio-visual pairs. In addition, a prediction and reconstruction task is designed to improve robustness under noise and partial modality absence. Experimental results show PASE significantly improves lip sync accuracy and achieves state-of-the-art performance across both NeRF- and 3DGS-based rendering frameworks, outperforming conventional methods based on acoustic features by 13.7 % and 14.2 %, respectively. Importantly, PASE can be seamlessly integrated into diverse talking head pipelines to improve the lip sync accuracy without architectural modifications.
