Learn2Talk: 3D Talking Face Learns from 2D Talking Face
Yixiang Zhuang, Baoping Cheng, Yao Cheng, Yuntao Jin, Renshuai Liu, Chengyang Li, Xuan Cheng, Jing Liao, Juncong Lin
TL;DR
Learn2Talk addresses the gap between 2D and 3D speech-driven facial animation by introducing a 3D lip-sync expert (SyncNet3D) and a teacher-guided lipreading constraint to supervise a 3D audio-to-motion regressor. The framework blends a transformer-based student with a 2D teacher through differentiable rendering and a lipreading network, and adds an optional PoseVAE head motion module, achieving improved lip-sync and 3D vertex accuracy while enabling applications in audio-visual speech recognition and 3D Gaussian Splatting-based avatars. Empirical results on BIWI and VOCASET show state-of-the-art lip-sync metrics and robust vertex quality, with ablations highlighting the complementary roles of the 3D sync loss and lipread constraint. This work provides a practical path to more natural, expressive 3D talking-face avatars and broader applications in AVSR and immersive systems, while identifying trade-offs that motivate future multi-task balancing and additional expressive cues like gaze and emotion.
Abstract
Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.
