Table of Contents
Fetching ...

AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation

Liyang Chen, Weihong Bao, Shun Lei, Boshi Tang, Zhiyong Wu, Shiyin Kang, Haozhi Huang, Helen Meng

TL;DR

AdaMesh tackles the challenge of personalized speech-driven 3D facial animation by learning an individual's talking style from a short reference video and generating both expressive facial movements and diverse head poses. It introduces a MoLoRA-based expression adapter and a retrieval-based pose adapter that leverages a VQ-VAE–PoseGPT pipeline and a semantic-aware pose style matrix to produce rich expressions and semantic-aligned poses without fine-tuning on new data. The approach uses FLAME for 3D representation, HuBERT-derived speech features, and a separate, data-efficient adaptation mechanism for expressions and poses, achieving strong lip-sync, expressive richness, and pose diversity with limited data. Experiments show AdaMesh outperforms state-of-the-art methods and runs efficiently enough for real-time applications, highlighting its practical impact for personalized avatars in VR, film, and games.

Abstract

Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at https://adamesh.github.io.

AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation

TL;DR

AdaMesh tackles the challenge of personalized speech-driven 3D facial animation by learning an individual's talking style from a short reference video and generating both expressive facial movements and diverse head poses. It introduces a MoLoRA-based expression adapter and a retrieval-based pose adapter that leverages a VQ-VAE–PoseGPT pipeline and a semantic-aware pose style matrix to produce rich expressions and semantic-aligned poses without fine-tuning on new data. The approach uses FLAME for 3D representation, HuBERT-derived speech features, and a separate, data-efficient adaptation mechanism for expressions and poses, achieving strong lip-sync, expressive richness, and pose diversity with limited data. Experiments show AdaMesh outperforms state-of-the-art methods and runs efficiently enough for real-time applications, highlighting its practical impact for personalized avatars in VR, film, and games.

Abstract

Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at https://adamesh.github.io.
Paper Structure (17 sections, 5 equations, 11 figures, 5 tables, 2 algorithms)

This paper contains 17 sections, 5 equations, 11 figures, 5 tables, 2 algorithms.

Figures (11)

  • Figure 1: The overview of AdaMesh. The expression adapter and pose adapter generate personalized facial expressions and head poses with the given speech signal and reference talking styles.
  • Figure 2: The overview of expression adapter. (a) The patches on each encoder and decoder denote the Conformer blocks are used and MoLoRA parameters are added to the pre-trained modules after adaptation. (b) A brief illustration of Conformer conformer_2021 (c) Illustration of 1D-Convolution weights and MoLoRA weights applied on the input features. MoLoRA combines $N$ LoRAs with different rank sizes of $r_i$. MoLoRA parameters are added to the convolution and linear layers in the Conformer blocks of the encoders and decoder to efficiently learn the expression style from the reference data.
  • Figure 3: (a) Architecture of the auto-encoder for expression reconstruction. (b) Reconstructed meshes using different rank sizes. (c) Diversity-expression scores for different rank sizes. ①②③ denote rank size 16, 64 and 128.
  • Figure 4: The overview of pose adapter. (a) Training of VQ-VAE. (b) PoseGPT. (c) The derivation of the semantic-aware pose style matrix and the retrieval strategy. In the training of the PoseGPT, the pose style embedding is the assigned one-hot label for each sample. $S$ denotes the semantic-aware pose style matrix. More details about the training and inference can be found in the supplementary materials.
  • Figure 5: Qualitative comparison with different methods. (a) shows lip movements on Obama dataset with neutral talking style. (b) is for observation of personalized facial expressions on the emotional MEAD dataset. (c) and (d) show head poses and corresponding landmark tracemaps on the VoxCeleb2-Test dataset. The first row displays the words or sentences that these frames are pronouncing.
  • ...and 6 more figures