AdaMesh: Personalized Facial Expressions and Head Poses for Adaptive Speech-Driven 3D Facial Animation
Liyang Chen, Weihong Bao, Shun Lei, Boshi Tang, Zhiyong Wu, Shiyin Kang, Haozhi Huang, Helen Meng
TL;DR
AdaMesh tackles the challenge of personalized speech-driven 3D facial animation by learning an individual's talking style from a short reference video and generating both expressive facial movements and diverse head poses. It introduces a MoLoRA-based expression adapter and a retrieval-based pose adapter that leverages a VQ-VAE–PoseGPT pipeline and a semantic-aware pose style matrix to produce rich expressions and semantic-aligned poses without fine-tuning on new data. The approach uses FLAME for 3D representation, HuBERT-derived speech features, and a separate, data-efficient adaptation mechanism for expressions and poses, achieving strong lip-sync, expressive richness, and pose diversity with limited data. Experiments show AdaMesh outperforms state-of-the-art methods and runs efficiently enough for real-time applications, highlighting its practical impact for personalized avatars in VR, film, and games.
Abstract
Speech-driven 3D facial animation aims at generating facial movements that are synchronized with the driving speech, which has been widely explored recently. Existing works mostly neglect the person-specific talking style in generation, including facial expression and head pose styles. Several works intend to capture the personalities by fine-tuning modules. However, limited training data leads to the lack of vividness. In this work, we propose AdaMesh, a novel adaptive speech-driven facial animation approach, which learns the personalized talking style from a reference video of about 10 seconds and generates vivid facial expressions and head poses. Specifically, we propose mixture-of-low-rank adaptation (MoLoRA) to fine-tune the expression adapter, which efficiently captures the facial expression style. For the personalized pose style, we propose a pose adapter by building a discrete pose prior and retrieving the appropriate style embedding with a semantic-aware pose style matrix without fine-tuning. Extensive experimental results show that our approach outperforms state-of-the-art methods, preserves the talking style in the reference video, and generates vivid facial animation. The supplementary video and code will be available at https://adamesh.github.io.
