Table of Contents
Fetching ...

DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation

Peng Chen, Xiaobao Wei, Ming Lu, Hui Chen, Feng Tian

TL;DR

DiffusionTalker targets real-time, personalized 3D facial animation driven by speech by integrating a contrastive personalizer with diffusion-based generation and a distillation-based efficiency strategy. The model learns identity and emotion embeddings from audio via contrastive learning, fuses them into a personalized embedding through cross-attention, and conditions a denoising decoder to produce lip-synced, emotionally expressive 3D faces. A key novelty is personalizer-guided distillation, which halves the denoising steps (from $N$ to $n$ with $N=2n$) to achieve >8x speedup, while also distilling a large audio encoder into a compact one to reduce storage by 86.4% with minimal performance loss. An additional personalizer enhancer reinforces the influence of identity/emotion embeddings on the generated animation. Extensive experiments on BEAT, 3D-ETF, and VOCASET demonstrate state-of-the-art lip accuracy (LVE), emotional expressiveness (EVE), and natural facial dynamics (FDD), with strong zero-shot generalization and ablations confirming the effectiveness of the personalization components and distillation strategy. The approach enables quick extraction of identity and emotion embeddings from audio to produce personalized animations, making real-time, expressive 3D talking heads more accessible for AR/VR and mobile applications.

Abstract

Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial animation. However, personalized speaking styles conveying accurate lip language is still lacking, besides, efficiency and compactness still need to be improved. In this work, we propose DiffusionTalker to address the above limitations via personalizer-guided distillation. In terms of personalization, we introduce a contrastive personalizer that learns identity and emotion embeddings to capture speaking styles from audio. We further propose a personalizer enhancer during distillation to enhance the influence of embeddings on facial animation. For efficiency, we use iterative distillation to reduce the steps required for animation generation and achieve more than 8x speedup in inference. To achieve compactness, we distill the large teacher model into a smaller student model, reducing our model's storage by 86.4\% while minimizing performance loss. After distillation, users can derive their identity and emotion embeddings from audio to quickly create personalized animations that reflect specific speaking styles. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released at: https://github.com/ChenVoid/DiffusionTalker.

DiffusionTalker: Efficient and Compact Speech-Driven 3D Talking Head via Personalizer-Guided Distillation

TL;DR

DiffusionTalker targets real-time, personalized 3D facial animation driven by speech by integrating a contrastive personalizer with diffusion-based generation and a distillation-based efficiency strategy. The model learns identity and emotion embeddings from audio via contrastive learning, fuses them into a personalized embedding through cross-attention, and conditions a denoising decoder to produce lip-synced, emotionally expressive 3D faces. A key novelty is personalizer-guided distillation, which halves the denoising steps (from to with ) to achieve >8x speedup, while also distilling a large audio encoder into a compact one to reduce storage by 86.4% with minimal performance loss. An additional personalizer enhancer reinforces the influence of identity/emotion embeddings on the generated animation. Extensive experiments on BEAT, 3D-ETF, and VOCASET demonstrate state-of-the-art lip accuracy (LVE), emotional expressiveness (EVE), and natural facial dynamics (FDD), with strong zero-shot generalization and ablations confirming the effectiveness of the personalization components and distillation strategy. The approach enables quick extraction of identity and emotion embeddings from audio to produce personalized animations, making real-time, expressive 3D talking heads more accessible for AR/VR and mobile applications.

Abstract

Real-time speech-driven 3D facial animation has been attractive in academia and industry. Traditional methods mainly focus on learning a deterministic mapping from speech to animation. Recent approaches start to consider the nondeterministic fact of speech-driven 3D face animation and employ the diffusion model for the task. Existing diffusion-based methods can improve the diversity of facial animation. However, personalized speaking styles conveying accurate lip language is still lacking, besides, efficiency and compactness still need to be improved. In this work, we propose DiffusionTalker to address the above limitations via personalizer-guided distillation. In terms of personalization, we introduce a contrastive personalizer that learns identity and emotion embeddings to capture speaking styles from audio. We further propose a personalizer enhancer during distillation to enhance the influence of embeddings on facial animation. For efficiency, we use iterative distillation to reduce the steps required for animation generation and achieve more than 8x speedup in inference. To achieve compactness, we distill the large teacher model into a smaller student model, reducing our model's storage by 86.4\% while minimizing performance loss. After distillation, users can derive their identity and emotion embeddings from audio to quickly create personalized animations that reflect specific speaking styles. Extensive experiments are conducted to demonstrate that our method outperforms state-of-the-art methods. The code will be released at: https://github.com/ChenVoid/DiffusionTalker.

Paper Structure

This paper contains 20 sections, 16 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The illustration of DiffusionTalker. We reduce the steps of the diffusion model for faster inference and compress the model size for compactness by personalizer-guided distillation. Our distilled 2-step model surpasses the state-of-the-art methods in terms of emotional expression and lip accuracy, while also achieving the fastest inference speed and the fewest model parameters.
  • Figure 2: Pipeline of DiffusionTalker. DiffusionTalker employs a contrastive personalizer to extract audio features and personalized embeddings from the input speech. These representations serve as conditioning inputs to guide the motion decoder in denoising noisy facial animations effectively. In personalizer-guided distillation process, the number of steps in the student model is iteratively reduced to half of the original, significantly accelerating inference. Simultaneously, the model parameters are compressed to create a more compact and efficient model. The personalizer enhancer integrates the personalized embedding with the facial areas of the predicted results, leveraging contrastive learning to strengthen the embedding’s representational capacity.
  • Figure 3: Contrastive Personalizer. Contrastive learning is performed separately on audio features and identity/emotion features, and specific identity/emotion embeddings are fused into personalized embeddings.
  • Figure 4: Personalizer Enhancer. It is used to enhance the personalization.
  • Figure 5: Qualitative comparisons with other methods on BEAT-Test(left), 3D-ETF(middle), and in-the-wild videos(right). We input speech with different identities and emotions into various models and present the same frames to compare them with the ground truth (GT). As indicated by the red box, it can be witnessed that on the first two datasets, our model accurately discerns facial action changes among various identities and precisely generates facial expressions corresponding to specific emotions. Even on in-the-wild videos, our model can produce accurate results.