StyleSpeaker: Audio-Enhanced Fine-Grained Style Modeling for Speech-Driven 3D Facial Animation
An Yang, Chenyu Liu, Pengcheng Xia, Jun Du
TL;DR
StyleSpeaker addresses the challenge of fine-grained, style-preserving speech-driven 3D facial animation under diverse speaking styles and limited 3D data. It introduces an audio-conditioned style encoder and a style infusion module that leverages style primitives to build a robust, decomposable style space, enabling rapid adaptation to unseen speakers without fine-tuning. The method is reinforced by a trend loss and a local contrastive loss to improve temporal coherence and lip-syncing, plus a Fourier Frequency Error metric for style-consistency evaluation. Across BIWI, VOCASET, and 3D-MEAD, StyleSpeaker outperforms state-of-the-art baselines in lip accuracy, overall motion fidelity, and style consistency, demonstrating strong generalization and practical potential for production pipelines.
Abstract
Speech-driven 3D facial animation is challenging due to the diversity in speaking styles and the limited availability of 3D audio-visual data. Speech predominantly dictates the coarse motion trends of the lip region, while specific styles determine the details of lip motion and the overall facial expressions. Prior works lack fine-grained learning in style modeling and do not adequately consider style biases across varying speech conditions, which reduce the accuracy of style modeling and hamper the adaptation capability to unseen speakers. To address this, we propose a novel framework, StyleSpeaker, which explicitly extracts speaking styles based on speaker characteristics while accounting for style biases caused by different speeches. Specifically, we utilize a style encoder to capture speakers' styles from facial motions and enhance them according to motion preferences elicited by varying speech conditions. The enhanced styles are then integrated into the coarse motion features via a style infusion module, which employs a set of style primitives to learn fine-grained style representation. Throughout training, we maintain this set of style primitives to comprehensively model the entire style space. Hence, StyleSpeaker possesses robust style modeling capability for seen speakers and can rapidly adapt to unseen speakers without fine-tuning. Additionally, we design a trend loss and a local contrastive loss to improve the synchronization between synthesized motions and speeches. Extensive qualitative and quantitative experiments on three public datasets demonstrate that our method outperforms existing state-of-the-art approaches.
