Table of Contents
Fetching ...

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

Yifan Yang, Zhi Cen, Sida Peng, Xiangwei Chen, Yifu Deng, Xinyu Zhu, Fan Jia, Xiaowei Zhou, Hujun Bao

TL;DR

This work tackles real-time speech-driven 3D facial animation by introducing an autoregressive diffusion model that streams facial motion generation conditioned on a short history of past frames and audio. Motion is encoded into a discrete latent space via a VQ-VAE, and an AR condition predictor fuses past latents, audio, and speaker identity to guide a lightweight diffusion head, enabling streaming output with low latency. The approach achieves state-of-the-art results on BIWI and VOCASET, particularly for long sequences, while maintaining consistent, low-latency inference suitable for interactive demos. The paper also provides a real-time streaming demo and discusses limitations and future directions, including broader identities and richer emotional expressiveness.

Abstract

This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs. Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations. However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.

StreamingTalker: Audio-driven 3D Facial Animation with Autoregressive Diffusion Model

TL;DR

This work tackles real-time speech-driven 3D facial animation by introducing an autoregressive diffusion model that streams facial motion generation conditioned on a short history of past frames and audio. Motion is encoded into a discrete latent space via a VQ-VAE, and an AR condition predictor fuses past latents, audio, and speaker identity to guide a lightweight diffusion head, enabling streaming output with low latency. The approach achieves state-of-the-art results on BIWI and VOCASET, particularly for long sequences, while maintaining consistent, low-latency inference suitable for interactive demos. The paper also provides a real-time streaming demo and discusses limitations and future directions, including broader identities and richer emotional expressiveness.

Abstract

This paper focuses on the task of speech-driven 3D facial animation, which aims to generate realistic and synchronized facial motions driven by speech inputs. Recent methods have employed audio-conditioned diffusion models for 3D facial animation, achieving impressive results in generating expressive and natural animations. However, these methods process the whole audio sequences in a single pass, which poses two major challenges: they tend to perform poorly when handling audio sequences that exceed the training horizon and will suffer from significant latency when processing long audio inputs. To address these limitations, we propose a novel autoregressive diffusion model that processes input audio in a streaming manner. This design ensures flexibility with varying audio lengths and achieves low latency independent of audio duration. Specifically, we select a limited number of past frames as historical motion context and combine them with the audio input to create a dynamic condition. This condition guides the diffusion process to iteratively generate facial motion frames, enabling real-time synthesis with high-quality results. Additionally, we implemented a real-time interactive demo, highlighting the effectiveness and efficiency of our approach. We will release the code at https://zju3dv.github.io/StreamingTalker/.

Paper Structure

This paper contains 28 sections, 13 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: (a) Overview of the pipeline. We employ an AR diffusion model to generate speech-driven 3D facial animations for inputs of arbitrary length. The model first encodes past motions $\mathbf{x}^{T-h:T-1}$, raw audio $\mathbf{a}^{T-h:T}$ and speaker identity $s_k$ to a dynamic condition. Then the diffusion head leverages this condition to guide the diffusion process. (b) The condition predictor. The condition predictor uses a transformer network to fuse the motion and audio modalities.
  • Figure 2: Qualitative comparison with the state-of-the-arts. The left side shows results on the VOCASET-Test dataset, while the right side shows results on the BIWI-Test-B dataset. Red words indicate phonemes being pronounced. Compared to other methods, our approach produces more natural lip shapes, with rounder mouth formations when pronouncing vowels like 'a', 'o', and 'u', and better lip closure for bilabial consonants such as 'm' and 'p'.
  • Figure 3: Inference latency for 3-27 second audio clips. The figure compares the performance of various models, including full sequence diffusion models (DiffSpeaker, FaceDiffuser), deterministic models (VOCA, MeshTalk), and AR models (FaceFormer, CodeTalker). Our model outperforms all non-AR models in terms of inference speed, maintaining consistent latency regardless of audio length.
  • Figure 4: Overview of our real-time demo system.
  • Figure 5: Ablation study on key components of the model. We perform six ablation experiments to evaluate the impact of different components. 1) w/o Diffusion Head: Removing the diffusion head significantly degrades model performance. 2) Use All History Motions: Omitting this results in overly smooth outputs. 3) w/o VQ-VAE: Without VQ-VAE, the model tends to collapse. 4) Use VAE as Encoder: No significant difference is observed, indicating our model is robust to the choice of encoder. 5) w/o Cross-Attention: Removing cross attention leads to erroneous results, as it is essential for incorporating speech information. 6) w/o Self-attention: Without self-attention, the model fails to maintain consistency.
  • ...and 1 more figures