Beyond Fixed Topologies: Unregistered Training and Comprehensive Evaluation Metrics for 3D Talking Heads
Federico Nocentini, Thomas Besnier, Claudio Ferrari, Sylvain Arguillere, Mohamed Daoudi, Stefano Berretti
TL;DR
ScanTalk tackles the challenge of animating 3D talking heads across arbitrary mesh topologies by introducing a topology-agnostic framework that employs DiffusionNet-based geometry and audio encoders to produce time-dependent per-vertex deformations. It supports both registered (topology-consistent sequences) and unregistered (varying topology per frame) training, leveraging a dynamic Chamfer loss in the latter and standard MSE/velocity/cosine losses in the former. The paper argues that conventional metrics (LVE, MVE, FDD) are insufficient for lip-sync and motion fidelity, proposing a new suite of metrics (DTW, DFD, $\delta_{M}$, $\delta_{Cd}$, etc.) and complementary distance measures (HD, $\mathcal{L}^K$) to enable a thorough evaluation. Empirical results on VOCAset, BIWI, and Multiface show ScanTalk achieving competitive performance and robustness, including against unregistered meshes and real scan data, with user studies corroborating the improved lip-sync realism. The work advances practical 3D facial animation by removing topology constraints, enabling broader deployment in media, VR/AR, and research contexts, and provides publicly available code and models.
Abstract
Generating speech-driven 3D talking heads presents numerous challenges; among those is dealing with varying mesh topologies where no point-wise correspondence exists across the meshes the model can animate. While previous literature works assume fixed mesh structures, in this work we present the first framework capable of animating 3D faces in arbitrary topologies, including real scanned data. Our approach leverages heat diffusion to predict features that are robust to the mesh topology. We explore two training settings: a registered one, in which meshes in a training sequences share a fixed topology but any mesh can be animated at test time, and an fully unregistered one, which allows effective training with varying mesh structures. Additionally, we highlight the limitations of current evaluation metrics and propose new metrics for better lip-syncing evaluation. An extensive evaluation shows our approach performs favorably compared to fixed topology techniques, setting a new benchmark by offering a versatile and high-fidelity solution for 3D talking heads where the topology constraint is dropped. The code along with the pre-trained model are available.
