Table of Contents
Fetching ...

Beyond Fixed Topologies: Unregistered Training and Comprehensive Evaluation Metrics for 3D Talking Heads

Federico Nocentini, Thomas Besnier, Claudio Ferrari, Sylvain Arguillere, Mohamed Daoudi, Stefano Berretti

TL;DR

ScanTalk tackles the challenge of animating 3D talking heads across arbitrary mesh topologies by introducing a topology-agnostic framework that employs DiffusionNet-based geometry and audio encoders to produce time-dependent per-vertex deformations. It supports both registered (topology-consistent sequences) and unregistered (varying topology per frame) training, leveraging a dynamic Chamfer loss in the latter and standard MSE/velocity/cosine losses in the former. The paper argues that conventional metrics (LVE, MVE, FDD) are insufficient for lip-sync and motion fidelity, proposing a new suite of metrics (DTW, DFD, $\delta_{M}$, $\delta_{Cd}$, etc.) and complementary distance measures (HD, $\mathcal{L}^K$) to enable a thorough evaluation. Empirical results on VOCAset, BIWI, and Multiface show ScanTalk achieving competitive performance and robustness, including against unregistered meshes and real scan data, with user studies corroborating the improved lip-sync realism. The work advances practical 3D facial animation by removing topology constraints, enabling broader deployment in media, VR/AR, and research contexts, and provides publicly available code and models.

Abstract

Generating speech-driven 3D talking heads presents numerous challenges; among those is dealing with varying mesh topologies where no point-wise correspondence exists across the meshes the model can animate. While previous literature works assume fixed mesh structures, in this work we present the first framework capable of animating 3D faces in arbitrary topologies, including real scanned data. Our approach leverages heat diffusion to predict features that are robust to the mesh topology. We explore two training settings: a registered one, in which meshes in a training sequences share a fixed topology but any mesh can be animated at test time, and an fully unregistered one, which allows effective training with varying mesh structures. Additionally, we highlight the limitations of current evaluation metrics and propose new metrics for better lip-syncing evaluation. An extensive evaluation shows our approach performs favorably compared to fixed topology techniques, setting a new benchmark by offering a versatile and high-fidelity solution for 3D talking heads where the topology constraint is dropped. The code along with the pre-trained model are available.

Beyond Fixed Topologies: Unregistered Training and Comprehensive Evaluation Metrics for 3D Talking Heads

TL;DR

ScanTalk tackles the challenge of animating 3D talking heads across arbitrary mesh topologies by introducing a topology-agnostic framework that employs DiffusionNet-based geometry and audio encoders to produce time-dependent per-vertex deformations. It supports both registered (topology-consistent sequences) and unregistered (varying topology per frame) training, leveraging a dynamic Chamfer loss in the latter and standard MSE/velocity/cosine losses in the former. The paper argues that conventional metrics (LVE, MVE, FDD) are insufficient for lip-sync and motion fidelity, proposing a new suite of metrics (DTW, DFD, , , etc.) and complementary distance measures (HD, ) to enable a thorough evaluation. Empirical results on VOCAset, BIWI, and Multiface show ScanTalk achieving competitive performance and robustness, including against unregistered meshes and real scan data, with user studies corroborating the improved lip-sync realism. The work advances practical 3D facial animation by removing topology constraints, enabling broader deployment in media, VR/AR, and research contexts, and provides publicly available code and models.

Abstract

Generating speech-driven 3D talking heads presents numerous challenges; among those is dealing with varying mesh topologies where no point-wise correspondence exists across the meshes the model can animate. While previous literature works assume fixed mesh structures, in this work we present the first framework capable of animating 3D faces in arbitrary topologies, including real scanned data. Our approach leverages heat diffusion to predict features that are robust to the mesh topology. We explore two training settings: a registered one, in which meshes in a training sequences share a fixed topology but any mesh can be animated at test time, and an fully unregistered one, which allows effective training with varying mesh structures. Additionally, we highlight the limitations of current evaluation metrics and propose new metrics for better lip-syncing evaluation. An extensive evaluation shows our approach performs favorably compared to fixed topology techniques, setting a new benchmark by offering a versatile and high-fidelity solution for 3D talking heads where the topology constraint is dropped. The code along with the pre-trained model are available.

Paper Structure

This paper contains 26 sections, 22 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: ScanTalk animates any 3D face mesh from speech, handling both registered and unregistered meshes across diverse datasets with a single model.
  • Figure 2: Overview of the proposed deep architecture. From a static mesh and an audio file, it computes time-dependent per-vertex features as a concatenation of geometric features $f_i^n$ and audio features $\hat{a_i}$. This learned signal over the mesh is used to learn a time-dependent displacement field, which produces the motion. At training time, the generated sequence is compared to the ground truth with different loss functions for the registered and unregistered cases.
  • Figure 3: Comparison of ScanTalk variants and state-of-the-art methods on a VOCAset test sequence, focusing on vertical lip movements ($y$-coordinate). Each row shows visual outputs and corresponding lip-sync plots. The $y$-coordinate captures the main dynamics of lip motion. ScanTalk benefits from additional loss terms, yielding more accurate and expressive lip movement. Here same colors represent same timestep.
  • Figure 4: Visualization results comparing error maps and multidimentional scaling.