Table of Contents
Fetching ...

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

Zhihao Xu, Shengjie Gong, Jiapeng Tang, Lingyu Liang, Yining Huang, Haojie Li, Shuangping Huang

TL;DR

KMTalk tackles the ill-posed problem of translating audio to 3D facial motion by introducing a progressive, key-motion embedding framework that combines linguistic priors with data-driven interpolation. The method halves the cross-modal uncertainty by first predicting high-quality key motions at phoneme boundaries via phoneme-based localization, then expanding them into full sequences through a cross-modal motion completion module guided by audio features. Empirical results on BIWI and VOCASET show superior lip synchronization and dynamic facial motions compared to state-of-the-art baselines, with ablations confirming the value of each component. The approach also generalizes to existing methods, consistently boosting performance when integrated, and offers a practical path toward more realistic and temporally coherent talking faces in real-time applications.

Abstract

We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings. Despite recent advancements in data-driven techniques, accurately mapping between audio signals and 3D facial meshes remains challenging. Direct regression of the entire sequence often leads to over-smoothed results due to the ill-posed nature of the problem. To this end, we propose a progressive learning mechanism that generates 3D facial animations by introducing key motion capture to decrease cross-modal mapping uncertainty and learning complexity. Concretely, our method integrates linguistic and data-driven priors through two modules: the linguistic-based key motion acquisition and the cross-modal motion completion. The former identifies key motions and learns the associated 3D facial expressions, ensuring accurate lip-speech synchronization. The latter extends key motions into a full sequence of 3D talking faces guided by audio features, improving temporal coherence and audio-visual consistency. Extensive experimental comparisons against existing state-of-the-art methods demonstrate the superiority of our approach in generating more vivid and consistent talking face animations. Consistent enhancements in results through the integration of our proposed learning scheme with existing methods underscore the efficacy of our approach. Our code and weights will be at the project website: \url{https://github.com/ffxzh/KMTalk}.

KMTalk: Speech-Driven 3D Facial Animation with Key Motion Embedding

TL;DR

KMTalk tackles the ill-posed problem of translating audio to 3D facial motion by introducing a progressive, key-motion embedding framework that combines linguistic priors with data-driven interpolation. The method halves the cross-modal uncertainty by first predicting high-quality key motions at phoneme boundaries via phoneme-based localization, then expanding them into full sequences through a cross-modal motion completion module guided by audio features. Empirical results on BIWI and VOCASET show superior lip synchronization and dynamic facial motions compared to state-of-the-art baselines, with ablations confirming the value of each component. The approach also generalizes to existing methods, consistently boosting performance when integrated, and offers a practical path toward more realistic and temporally coherent talking faces in real-time applications.

Abstract

We present a novel approach for synthesizing 3D facial motions from audio sequences using key motion embeddings. Despite recent advancements in data-driven techniques, accurately mapping between audio signals and 3D facial meshes remains challenging. Direct regression of the entire sequence often leads to over-smoothed results due to the ill-posed nature of the problem. To this end, we propose a progressive learning mechanism that generates 3D facial animations by introducing key motion capture to decrease cross-modal mapping uncertainty and learning complexity. Concretely, our method integrates linguistic and data-driven priors through two modules: the linguistic-based key motion acquisition and the cross-modal motion completion. The former identifies key motions and learns the associated 3D facial expressions, ensuring accurate lip-speech synchronization. The latter extends key motions into a full sequence of 3D talking faces guided by audio features, improving temporal coherence and audio-visual consistency. Extensive experimental comparisons against existing state-of-the-art methods demonstrate the superiority of our approach in generating more vivid and consistent talking face animations. Consistent enhancements in results through the integration of our proposed learning scheme with existing methods underscore the efficacy of our approach. Our code and weights will be at the project website: \url{https://github.com/ffxzh/KMTalk}.
Paper Structure (28 sections, 6 equations, 11 figures, 9 tables)

This paper contains 28 sections, 6 equations, 11 figures, 9 tables.

Figures (11)

  • Figure 1: Compared to the state-of-the-art method Selftalk, our approach can produce more vivid lip motions from speeches, since we introduce linguistic priors to characterize key motions and utilize data-driven priors to interpolate non-key motions.
  • Figure 2: Fig. \ref{['fig:pipe1']} illustrates the overview pipeline of our proposed KMTalk. Initially, the Audio Encoder takes the input raw audio $\mathbf{x}$ and encodes it into audio features $\mathbf{A}$. Subsequently, in the LKMA module, key motions $\mathbf{K}$ are generated from the audio $\mathbf{x}$ and $\mathbf{A}$. Finally, the CMC module reintroduces audio features $\mathbf{A}$ to extend these key motions $\mathbf{K}$ into a full sequence $\mathbf{Y}$. Fig. \ref{['fig:pipe2']} presents the details of two key modules in KMTalk. In the Linguistic-based Key Motion Acquisition, a Phoneme-based Localization Method is used to identify key motion indices $\mathcal{I}$ from raw audio $\mathbf{x}$. Based on audio features $\mathbf{A}$ and $\mathcal{I}$, the Key Motion-focused Decoder generates key motions $\mathbf{K}$. In the Cross-modal Motion Completion, the Motion Flow Encoder processes $\mathbf{K}$ and $\mathcal{I}$, producing motion flow features $\mathbf{\Phi}$. Then, with the dynamic fusion weight $\mathbf{G}$, the Multimodal-Guided Decoder combines $\mathbf{\Phi}$ and $\mathbf{A}$ to decode the final motion sequence $\mathbf{Y}$.
  • Figure 3: Qualitative comparisons on VOCA-Test (left) and BIWI-Test-B (right). We provide visual comparisons of facial animations synchronized with six syllables extracted from the test speech sequences. The 1st, 3rd, and 5th rows display synthesized meshes and their corresponding ground-truths, while the 2nd, 4th, and 6th rows visualize the L2 loss for individual frames. Our method demonstrates more precise mouth movement on syllables like /æ/ that require a wide-open mouth. For syllables that start with a closed mouth and then slightly open, such as /bI/, our KMTalk generates more synchronized motion sequences visually. The last row visualizes the mean square errors of different methods across all sentences in the test set for a specific subject.
  • Figure 4: Qualitative ablation studies on the input speech "specifically". For each method variant, we removed one of three modules: PLM (Phoneme-based Localization Method), KMD (Key Motion-focused Decoder), and AG (Audio Guidance in CMC). Error maps between generated and the ground-truth mesh sequence were visualized. Our final model yielded the best results, showcasing the effectiveness of each module.
  • Figure 5: Robust analysis of Phoneme-based Localization on BIWI-Test-A.
  • ...and 6 more figures