DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation
Peng Chen, Xiaobao Wei, Ming Lu, Hui Chen, Feng Tian
TL;DR
DiffusionTalker targets real-time, personalized 3D facial animation driven by speech by integrating a contrastive personalizer with diffusion-based generation and a distillation-based efficiency strategy. The model learns identity and emotion embeddings from audio via contrastive learning, fuses them into a personalized embedding through cross-attention, and conditions a denoising decoder to produce lip-synced, emotionally expressive 3D faces. A key novelty is personalizer-guided distillation, which halves the denoising steps (from $N$ to $n$ with $N=2n$) to achieve >8x speedup, while also distilling a large audio encoder into a compact one to reduce storage by 86.4% with minimal performance loss. An additional personalizer enhancer reinforces the influence of identity/emotion embeddings on the generated animation. Extensive experiments on BEAT, 3D-ETF, and VOCASET demonstrate state-of-the-art lip accuracy (LVE), emotional expressiveness (EVE), and natural facial dynamics (FDD), with strong zero-shot generalization and ablations confirming the effectiveness of the personalization components and distillation strategy. The approach enables quick extraction of identity and emotion embeddings from audio to produce personalized animations, making real-time, expressive 3D talking heads more accessible for AR/VR and mobile applications.
Abstract
Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial animation. However, personalized speaking styles conveying accurate lip language is still lacking, besides, efficiency and compactness still need to be improved. In this work, we propose DiffusionTalker to address the above limitations via personalizer-guided distillation. In terms of personalization, we introduce a contrastive personalizer that learns identity and emotion embeddings to capture speaking styles from audio. We further propose a personalizer enhancer during distillation to enhance the influence of embeddings on facial animation. For efficiency, we use iterative distillation to reduce the steps required for animation generation and achieve more than 8x speedup in inference. To achieve compactness, we distill the large teacher model into a smaller student model, reducing our model's storage by 86.4\% while minimizing performance loss. After distillation, users can derive their identity and emotion embeddings from audio to quickly create personalized animations that reflect specific speaking styles. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released at: https://github.com/ChenVoid/DiffusionTalker.
