Table of Contents
Fetching ...

DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer

Zhiyuan Ma, Xiangyu Zhu, Guojun Qi, Chen Qian, Zhaoxiang Zhang, Zhen Lei

TL;DR

The paper tackles speech-driven 3D facial animation under limited audio-4D data by marrying diffusion models with a Transformer architecture. It introduces DiffSpeaker, a diffusion-based Transformer that uses biased conditional self/cross-attention and condition tokens for audio, speaking style, and diffusion steps to enable parallel, data-efficient generation. Empirical results on BIWI and VOCASET show state-of-the-art lip synchronization and expressive dynamics, with ablations underscoring the necessity of cross- and self-attention biases and the conditioning scheme. The approach achieves fast inference relative to prior diffusion-based methods, making it practically appealing for real-time or interactive applications.

Abstract

Speech-driven 3D facial animation is important for many multimedia applications. Recent work has shown promise in using either Diffusion models or Transformer architectures for this task. However, their mere aggregation does not lead to improved performance. We suspect this is due to a shortage of paired audio-4D data, which is crucial for the Transformer to effectively perform as a denoiser within the Diffusion framework. To tackle this issue, we present DiffSpeaker, a Transformer-based network equipped with novel biased conditional attention modules. These modules serve as substitutes for the traditional self/cross-attention in standard Transformers, incorporating thoughtfully designed biases that steer the attention mechanisms to concentrate on both the relevant task-specific and diffusion-related conditions. We also explore the trade-off between accurate lip synchronization and non-verbal facial expressions within the Diffusion paradigm. Experiments show our model not only achieves state-of-the-art performance on existing benchmarks, but also fast inference speed owing to its ability to generate facial motions in parallel.

DiffSpeaker: Speech-Driven 3D Facial Animation with Diffusion Transformer

TL;DR

The paper tackles speech-driven 3D facial animation under limited audio-4D data by marrying diffusion models with a Transformer architecture. It introduces DiffSpeaker, a diffusion-based Transformer that uses biased conditional self/cross-attention and condition tokens for audio, speaking style, and diffusion steps to enable parallel, data-efficient generation. Empirical results on BIWI and VOCASET show state-of-the-art lip synchronization and expressive dynamics, with ablations underscoring the necessity of cross- and self-attention biases and the conditioning scheme. The approach achieves fast inference relative to prior diffusion-based methods, making it practically appealing for real-time or interactive applications.

Abstract

Speech-driven 3D facial animation is important for many multimedia applications. Recent work has shown promise in using either Diffusion models or Transformer architectures for this task. However, their mere aggregation does not lead to improved performance. We suspect this is due to a shortage of paired audio-4D data, which is crucial for the Transformer to effectively perform as a denoiser within the Diffusion framework. To tackle this issue, we present DiffSpeaker, a Transformer-based network equipped with novel biased conditional attention modules. These modules serve as substitutes for the traditional self/cross-attention in standard Transformers, incorporating thoughtfully designed biases that steer the attention mechanisms to concentrate on both the relevant task-specific and diffusion-related conditions. We also explore the trade-off between accurate lip synchronization and non-verbal facial expressions within the Diffusion paradigm. Experiments show our model not only achieves state-of-the-art performance on existing benchmarks, but also fast inference speed owing to its ability to generate facial motions in parallel.
Paper Structure (17 sections, 10 equations, 5 figures, 2 tables)

This paper contains 17 sections, 10 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: DiffSpeaker outperforms in quality (FDD, LVE metrics in Table \ref{['tab:quantitative_comparsion']}) and speed, with circle size indicating average inference latency for 10-90s duration of audio clips. Despite Diffusion-based generation, it ensures fast inference (details in Figure \ref{['Fig:latency']}).
  • Figure 2: DiffSpeaker synthesizes facial motions $\mathbf{x}^{1:T}$ from speech audio $\mathbf{a}^{1:T}$ and a subject's speaking style $\mathbf{s}_k$, utilizing a Diffusion-based iterative denoising technique. Its core feature is a biased conditional attention mechanism that introduces static biases in self/cross-attention and employs encodings $\mathbf{e}_s$ and $\mathbf{e}_n$ to integrate speaking style and diffusion step information.
  • Figure 3: Qualititive Comparision
  • Figure 4: Standard deviation of facial motion.
  • Figure 5: Inference latency for 10-90 second audio clips.