Table of Contents
Fetching ...

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

Xiangyu Fan, Jiaqi Li, Zhiqian Lin, Weiye Xiao, Lei Yang

TL;DR

UniTalker introduces a unified multi-head model to scale audio-driven 3D facial animation across datasets with heterogeneous annotations. By combining PCA-based output balancing, decoder warm-up, and Pivot Identity Embedding, the model learns from diverse languages and vocal types, supported by the A2F-Bench dataset (~18.53 hours, 934 speakers). Empirical results show substantial lip-synchronization gains (e.g., ~9.2% on BIWI and ~13.7% on VOCA) and faster inference than prior methods, with the pre-trained UniTalker serving effectively as a foundation model for unseen annotations and data-scarce scenarios. The work demonstrates that larger, more diverse training data improves performance and that fine-tuning on seen data yields average improvements around 6.3%, while annotation transfer can require less data than re-training with conventional encoders. Collectively, UniTalker offers a scalable, versatile framework for cross-domain audio-to-face generation and a practical foundation model for future audio-visual synthesis tasks.

Abstract

Audio-driven 3D facial animation aims to map input audio to realistic facial motion. Despite significant progress, limitations arise from inconsistent 3D annotations, restricting previous models to training on specific annotations and thereby constraining the training scale. In this work, we present UniTalker, a unified model featuring a multi-head architecture designed to effectively leverage datasets with varied annotations. To enhance training stability and ensure consistency among multi-head outputs, we employ three training strategies, namely, PCA, model warm-up, and pivot identity embedding. To expand the training scale and diversity, we assemble A2F-Bench, comprising five publicly available datasets and three newly curated datasets. These datasets contain a wide range of audio domains, covering multilingual speech voices and songs, thereby scaling the training data from commonly employed datasets, typically less than 1 hour, to 18.5 hours. With a single trained UniTalker model, we achieve substantial lip vertex error reductions of 9.2% for BIWI dataset and 13.7% for Vocaset. Additionally, the pre-trained UniTalker exhibits promise as the foundation model for audio-driven facial animation tasks. Fine-tuning the pre-trained UniTalker on seen datasets further enhances performance on each dataset, with an average error reduction of 6.3% on A2F-Bench. Moreover, fine-tuning UniTalker on an unseen dataset with only half the data surpasses prior state-of-the-art models trained on the full dataset. The code and dataset are available at the project page https://github.com/X-niper/UniTalker.

UniTalker: Scaling up Audio-Driven 3D Facial Animation through A Unified Model

TL;DR

UniTalker introduces a unified multi-head model to scale audio-driven 3D facial animation across datasets with heterogeneous annotations. By combining PCA-based output balancing, decoder warm-up, and Pivot Identity Embedding, the model learns from diverse languages and vocal types, supported by the A2F-Bench dataset (~18.53 hours, 934 speakers). Empirical results show substantial lip-synchronization gains (e.g., ~9.2% on BIWI and ~13.7% on VOCA) and faster inference than prior methods, with the pre-trained UniTalker serving effectively as a foundation model for unseen annotations and data-scarce scenarios. The work demonstrates that larger, more diverse training data improves performance and that fine-tuning on seen data yields average improvements around 6.3%, while annotation transfer can require less data than re-training with conventional encoders. Collectively, UniTalker offers a scalable, versatile framework for cross-domain audio-to-face generation and a practical foundation model for future audio-visual synthesis tasks.

Abstract

Audio-driven 3D facial animation aims to map input audio to realistic facial motion. Despite significant progress, limitations arise from inconsistent 3D annotations, restricting previous models to training on specific annotations and thereby constraining the training scale. In this work, we present UniTalker, a unified model featuring a multi-head architecture designed to effectively leverage datasets with varied annotations. To enhance training stability and ensure consistency among multi-head outputs, we employ three training strategies, namely, PCA, model warm-up, and pivot identity embedding. To expand the training scale and diversity, we assemble A2F-Bench, comprising five publicly available datasets and three newly curated datasets. These datasets contain a wide range of audio domains, covering multilingual speech voices and songs, thereby scaling the training data from commonly employed datasets, typically less than 1 hour, to 18.5 hours. With a single trained UniTalker model, we achieve substantial lip vertex error reductions of 9.2% for BIWI dataset and 13.7% for Vocaset. Additionally, the pre-trained UniTalker exhibits promise as the foundation model for audio-driven facial animation tasks. Fine-tuning the pre-trained UniTalker on seen datasets further enhances performance on each dataset, with an average error reduction of 6.3% on A2F-Bench. Moreover, fine-tuning UniTalker on an unseen dataset with only half the data surpasses prior state-of-the-art models trained on the full dataset. The code and dataset are available at the project page https://github.com/X-niper/UniTalker.
Paper Structure (28 sections, 4 equations, 7 figures, 11 tables)

This paper contains 28 sections, 4 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Left: UniTalker aims to learn from diverse datasets in a unified manner. It takes multilingual, multi-vocal-type audios as input and outputs various 3D facial annotation conventions simultaneously. Right: Finetuning UniTalker on each dataset consistently shows lower lip vertex error (LVE) than training the model on the dataset, leading to an average LVE drop of 6.3%. Refer to \ref{['tab:LVE_single_all_and_finetune']} for comprehensive numerical results.
  • Figure 1: Architecture Comparison. (a) Vanilla multi-head audio-to-face model. (b) UniTalker adopts PCA to balance the annotation dimension across datasets, uses decoder warm-up to stabilize training, and develops a pivot identity embedding to mitigate dataset bias. (c) Zoomed-in view of UniTalker-[D0-D7] decoder. UniTalker-[D0-D7] has 6 decoder heads.
  • Figure 2: UniTalker architecture. UniTalker adopts vertices PCA to balance the annotation dimension across datasets, uses decoder warm-up to stablize training, and develops a pivot identity embedding to mitigate dataset bias.
  • Figure 3: Effect of PIE. Without PIE, the model generates unnatural face motion when input identity and output annotation mismatch.
  • Figure 4: Comparison between finetuning Wav2vec2-xlsr-53 Wav2Vec2_XLSR_53 and UniTalker-L-[D1-D7] on D0. The x-axis is in log-scale.
  • ...and 2 more figures