Table of Contents
Fetching ...

Meta-Learning Empowered Meta-Face: Personalized Speaking Style Adaptation for Audio-Driven 3D Talking Face Animation

Xukun Zhou, Fengxin Li, Ziqiao Peng, Kejian Wu, Jun He, Biao Qin, Zhaoxin Fan, Hongyan Liu

TL;DR

MetaFace tackles the challenge of adapting audio-driven 3D talking face animation to varied speaking styles with minimal data. It introduces a meta-learning-based architecture comprising Robust Meta Initialization Stage, Dynamic Relation Mining Neural Process, and a Low-rank Matrix Memory Reduction mechanism to enable fast, memory-efficient personalization. Across VOCASet and BIWI, MetaFace achieves state-of-the-art performance in both facial motion and lip synchronization, with strong ablation evidence showing contributions from each component. The work offers a practical path toward personalized, high-fidelity avatars for live streaming and AR applications, reducing data requirements and computational load for on-device adaptation.

Abstract

Audio-driven 3D face animation is increasingly vital in live streaming and augmented reality applications. While remarkable progress has been observed, most existing approaches are designed for specific individuals with predefined speaking styles, thus neglecting the adaptability to varied speaking styles. To address this limitation, this paper introduces MetaFace, a novel methodology meticulously crafted for speaking style adaptation. Grounded in the novel concept of meta-learning, MetaFace is composed of several key components: the Robust Meta Initialization Stage (RMIS) for fundamental speaking style adaptation, the Dynamic Relation Mining Neural Process (DRMN) for forging connections between observed and unobserved speaking styles, and the Low-rank Matrix Memory Reduction Approach to enhance the efficiency of model optimization as well as learning style details. Leveraging these novel designs, MetaFace not only significantly outperforms robust existing baselines but also establishes a new state-of-the-art, as substantiated by our experimental results.

Meta-Learning Empowered Meta-Face: Personalized Speaking Style Adaptation for Audio-Driven 3D Talking Face Animation

TL;DR

MetaFace tackles the challenge of adapting audio-driven 3D talking face animation to varied speaking styles with minimal data. It introduces a meta-learning-based architecture comprising Robust Meta Initialization Stage, Dynamic Relation Mining Neural Process, and a Low-rank Matrix Memory Reduction mechanism to enable fast, memory-efficient personalization. Across VOCASet and BIWI, MetaFace achieves state-of-the-art performance in both facial motion and lip synchronization, with strong ablation evidence showing contributions from each component. The work offers a practical path toward personalized, high-fidelity avatars for live streaming and AR applications, reducing data requirements and computational load for on-device adaptation.

Abstract

Audio-driven 3D face animation is increasingly vital in live streaming and augmented reality applications. While remarkable progress has been observed, most existing approaches are designed for specific individuals with predefined speaking styles, thus neglecting the adaptability to varied speaking styles. To address this limitation, this paper introduces MetaFace, a novel methodology meticulously crafted for speaking style adaptation. Grounded in the novel concept of meta-learning, MetaFace is composed of several key components: the Robust Meta Initialization Stage (RMIS) for fundamental speaking style adaptation, the Dynamic Relation Mining Neural Process (DRMN) for forging connections between observed and unobserved speaking styles, and the Low-rank Matrix Memory Reduction Approach to enhance the efficiency of model optimization as well as learning style details. Leveraging these novel designs, MetaFace not only significantly outperforms robust existing baselines but also establishes a new state-of-the-art, as substantiated by our experimental results.
Paper Structure (20 sections, 14 equations, 3 figures, 4 tables, 1 algorithm)

This paper contains 20 sections, 14 equations, 3 figures, 4 tables, 1 algorithm.

Figures (3)

  • Figure 1: Difference between speaking style adaptation mode of existing methods and MetaFace.
  • Figure 2: Overall Framework of MetaFace.
  • Figure 3: Visual comparisons of facial movement by different methods on VOCA-Test (left) and BIWI-Test-A (right).