Table of Contents
Fetching ...

Learn2Talk: 3D Talking Face Learns from 2D Talking Face

Yixiang Zhuang, Baoping Cheng, Yao Cheng, Yuntao Jin, Renshuai Liu, Chengyang Li, Xuan Cheng, Jing Liao, Juncong Lin

TL;DR

Learn2Talk addresses the gap between 2D and 3D speech-driven facial animation by introducing a 3D lip-sync expert (SyncNet3D) and a teacher-guided lipreading constraint to supervise a 3D audio-to-motion regressor. The framework blends a transformer-based student with a 2D teacher through differentiable rendering and a lipreading network, and adds an optional PoseVAE head motion module, achieving improved lip-sync and 3D vertex accuracy while enabling applications in audio-visual speech recognition and 3D Gaussian Splatting-based avatars. Empirical results on BIWI and VOCASET show state-of-the-art lip-sync metrics and robust vertex quality, with ablations highlighting the complementary roles of the 3D sync loss and lipread constraint. This work provides a practical path to more natural, expressive 3D talking-face avatars and broader applications in AVSR and immersive systems, while identifying trade-offs that motivate future multi-task balancing and additional expressive cues like gaze and emotion.

Abstract

Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.

Learn2Talk: 3D Talking Face Learns from 2D Talking Face

TL;DR

Learn2Talk addresses the gap between 2D and 3D speech-driven facial animation by introducing a 3D lip-sync expert (SyncNet3D) and a teacher-guided lipreading constraint to supervise a 3D audio-to-motion regressor. The framework blends a transformer-based student with a 2D teacher through differentiable rendering and a lipreading network, and adds an optional PoseVAE head motion module, achieving improved lip-sync and 3D vertex accuracy while enabling applications in audio-visual speech recognition and 3D Gaussian Splatting-based avatars. Empirical results on BIWI and VOCASET show state-of-the-art lip-sync metrics and robust vertex quality, with ablations highlighting the complementary roles of the 3D sync loss and lipread constraint. This work provides a practical path to more natural, expressive 3D talking-face avatars and broader applications in AVSR and immersive systems, while identifying trade-offs that motivate future multi-task balancing and additional expressive cues like gaze and emotion.

Abstract

Speech-driven facial animation methods usually contain two main classes, 3D and 2D talking face, both of which attract considerable research attention in recent years. However, to the best of our knowledge, the research on 3D talking face does not go deeper as 2D talking face, in the aspect of lip-synchronization (lip-sync) and speech perception. To mind the gap between the two sub-fields, we propose a learning framework named Learn2Talk, which can construct a better 3D talking face network by exploiting two expertise points from the field of 2D talking face. Firstly, inspired by the audio-video sync network, a 3D sync-lip expert model is devised for the pursuit of lip-sync between audio and 3D facial motion. Secondly, a teacher model selected from 2D talking face methods is used to guide the training of the audio-to-3D motions regression network to yield more 3D vertex accuracy. Extensive experiments show the advantages of the proposed framework in terms of lip-sync, vertex accuracy and speech perception, compared with state-of-the-arts. Finally, we show two applications of the proposed framework: audio-visual speech recognition and speech-driven 3D Gaussian Splatting based avatar animation.
Paper Structure (21 sections, 8 equations, 9 figures, 6 tables)

This paper contains 21 sections, 8 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Given a speech audio, our method can synthesize facial motion, head motion and drive a 3D Gaussian Splatting 3DGSTOG2023 based head avatar.
  • Figure 2: Concept diagram of Learn2Talk framework.
  • Figure 3: The pipeline of the proposed Learn2Talk framework. In training, all modules are used. In inference, only the student model is used, including embedding layer, transformed decoder and speech encoder.
  • Figure 4: The static image (a) is used in one-shot 2D talking face methods to generate animated face (d). The facial textures (b)(c) are used in rendering 3D facial motions to videos (e)(f).
  • Figure 5: The distribution of $L_2$ distances for genuine and false audio-3D motions pairs.
  • ...and 4 more figures