Table of Contents
Fetching ...

ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

Xuangeng Chu, Nabarun Goswami, Ziteng Cui, Hanqin Wang, Tatsuya Harada

TL;DR

ARTalk addresses the challenge of real-time, high-fidelity speech-driven 3D head animation with strong generalization to unseen speaking styles. It introduces a temporal multi-scale VQ autoencoder to learn a discrete motion codebook and a conditional autoregressive Transformer to map speech to multi-scale motion codes within sliding time windows, augmented by a style encoder. The two-stage training, cross-window causal reasoning, and multi-scale architecture yield superior lip synchronization, expressive timing, and stylistic consistency while maintaining real-time performance. The approach demonstrates robust results across multiple datasets, user studies, and ablations, highlighting its potential for real-time digital humans in education, entertainment, and interactive applications, alongside ethical considerations for synthetic media.

Abstract

Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. Although existing diffusion-based methods are capable of producing natural motions, their slow generation speed limits their application potential. In this paper, we introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks by learning a mapping from speech to a multi-scale motion codebook. Furthermore, our model can adapt to unseen speaking styles, enabling the creation of 3D talking avatars with unique personal styles beyond the identities seen during training. Extensive evaluations and user studies demonstrate that our method outperforms existing approaches in lip synchronization accuracy and perceived quality.

ARTalk: Speech-Driven 3D Head Animation via Autoregressive Model

TL;DR

ARTalk addresses the challenge of real-time, high-fidelity speech-driven 3D head animation with strong generalization to unseen speaking styles. It introduces a temporal multi-scale VQ autoencoder to learn a discrete motion codebook and a conditional autoregressive Transformer to map speech to multi-scale motion codes within sliding time windows, augmented by a style encoder. The two-stage training, cross-window causal reasoning, and multi-scale architecture yield superior lip synchronization, expressive timing, and stylistic consistency while maintaining real-time performance. The approach demonstrates robust results across multiple datasets, user studies, and ablations, highlighting its potential for real-time digital humans in education, entertainment, and interactive applications, alongside ethical considerations for synthetic media.

Abstract

Speech-driven 3D facial animation aims to generate realistic lip movements and facial expressions for 3D head models from arbitrary audio clips. Although existing diffusion-based methods are capable of producing natural motions, their slow generation speed limits their application potential. In this paper, we introduce a novel autoregressive model that achieves real-time generation of highly synchronized lip movements and realistic head poses and eye blinks by learning a mapping from speech to a multi-scale motion codebook. Furthermore, our model can adapt to unseen speaking styles, enabling the creation of 3D talking avatars with unique personal styles beyond the identities seen during training. Extensive evaluations and user studies demonstrate that our method outperforms existing approaches in lip synchronization accuracy and perceived quality.

Paper Structure

This paper contains 25 sections, 7 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: We present ARTalk, a framework for speech-driven 3D facial motion generation. Our method learns a mapping from speech to a multi-scale motion code, enabling the real-time generation of realistic and diverse animation sequences.
  • Figure 2: ARTalk involves two separated parts. (a) shows our temporal multi-scale VQ autoencoder. It encodes motion sequences into multi-scale token maps $[M_{k_1}, M_{k_2}, ..., M_{K}]$ using a shared codebook and causal masking on temporal. (b) shows The ARTalk Causal Transformer, where training uses ground truth tokens with a block-wise causal attention mask, and inference autoregressively predicts motion tokens conditioned on speech features and last scale tokens and last time window motions.
  • Figure 3: Comparison of efficiency and performance across different methods.
  • Figure 4: Qualitative comparison with existing methods (all head poses fixed). The first four rows are from the TFHP dataset, and the last two rows are from the VOCASET dataset. Our method shows better alignment with the ground truth in expression style, mouth dynamics, and lip synchronization. Additional results are available in the supplementary materials and demo videos.
  • Figure 5: Qualitative results of head pose. When certain words are stressed or when accents occur, the model produces nodding motions similar to human behavior.
  • ...and 2 more figures