Table of Contents
Fetching ...

ScanTalk: 3D Talking Heads from Unregistered Scans

Federico Nocentini, Thomas Besnier, Claudio Ferrari, Sylvain Arguillere, Stefano Berretti, Mohamed Daoudi

TL;DR

This work presents ScanTalk, a novel framework capable of animating 3D faces in arbitrary topologies including scanned data, and relies on the DiffusionNet architecture to overcome the fixed topology constraint, offering promising avenues for more flexible and realistic 3D animations.

Abstract

Speech-driven 3D talking heads generation has emerged as a significant area of interest among researchers, presenting numerous challenges. Existing methods are constrained by animating faces with fixed topologies, wherein point-wise correspondence is established, and the number and order of points remains consistent across all identities the model can animate. In this work, we present \textbf{ScanTalk}, a novel framework capable of animating 3D faces in arbitrary topologies including scanned data. Our approach relies on the DiffusionNet architecture to overcome the fixed topology constraint, offering promising avenues for more flexible and realistic 3D animations. By leveraging the power of DiffusionNet, ScanTalk not only adapts to diverse facial structures but also maintains fidelity when dealing with scanned data, thereby enhancing the authenticity and versatility of generated 3D talking heads. Through comprehensive comparisons with state-of-the-art methods, we validate the efficacy of our approach, demonstrating its capacity to generate realistic talking heads comparable to existing techniques. While our primary objective is to develop a generic method free from topological constraints, all state-of-the-art methodologies are bound by such limitations. Code for reproducing our results, and the pre-trained model are available at https://github.com/miccunifi/ScanTalk .

ScanTalk: 3D Talking Heads from Unregistered Scans

TL;DR

This work presents ScanTalk, a novel framework capable of animating 3D faces in arbitrary topologies including scanned data, and relies on the DiffusionNet architecture to overcome the fixed topology constraint, offering promising avenues for more flexible and realistic 3D animations.

Abstract

Speech-driven 3D talking heads generation has emerged as a significant area of interest among researchers, presenting numerous challenges. Existing methods are constrained by animating faces with fixed topologies, wherein point-wise correspondence is established, and the number and order of points remains consistent across all identities the model can animate. In this work, we present \textbf{ScanTalk}, a novel framework capable of animating 3D faces in arbitrary topologies including scanned data. Our approach relies on the DiffusionNet architecture to overcome the fixed topology constraint, offering promising avenues for more flexible and realistic 3D animations. By leveraging the power of DiffusionNet, ScanTalk not only adapts to diverse facial structures but also maintains fidelity when dealing with scanned data, thereby enhancing the authenticity and versatility of generated 3D talking heads. Through comprehensive comparisons with state-of-the-art methods, we validate the efficacy of our approach, demonstrating its capacity to generate realistic talking heads comparable to existing techniques. While our primary objective is to develop a generic method free from topological constraints, all state-of-the-art methodologies are bound by such limitations. Code for reproducing our results, and the pre-trained model are available at https://github.com/miccunifi/ScanTalk .
Paper Structure (27 sections, 9 equations, 12 figures, 4 tables)

This paper contains 27 sections, 9 equations, 12 figures, 4 tables.

Figures (12)

  • Figure 1: We present ScanTalk, a deep learning architecture to animate any 3D face mesh driven by a speech. ScanTalk is robust enough to learn on multiple unrelated datasets with a unique model, whilst allowing us to infer on unregistered face meshes.
  • Figure 1: ScanTalk performance, in both single-dataset (s-d) and multi-dataset (m-d) scenarios, in comparison with state-of-the-art methods. The heatmaps show the differences between the first frame and subsequent frames within sequences among the VOCAset, BIWI$_6$, and Multiface datasets. Notably, in the VOCAset, primarily the lips display movement, whereas in BIWI$_6$ and Multiface datasets, substantial head and upper face movements are observed. The color gradient on the face meshes corresponds to the average per-vertex $L_2$ norm of the differences, where blue hues indicate lower values, and red hues indicate higher values.
  • Figure 2: Architecture of ScanTalk. A novel Encoder-Decoder framework designed to dynamically animate any 3D face based on a spoken sentence from an audio file. The Encoder integrates the 3D neutral face $m_i^n$, per-vertex surface features $P_i^{n}$ (crucial for DiffusionNet and precomputed by the operators $OP$), and the audio file $A_i$, yielding a fusion of per-vertex and audio features. These combined descriptors, alongside $P_i^n$, are then passed to the Decoder, which mirrors a reversed DiffusionNet encoder structure. The Decoder predicts the deformation of the 3D neutral face, which is then combined with the original 3D neutral face $m_i^n$ to generate the animated sequence.
  • Figure 3: ScanTalk GPU memory usage with respect to the mesh resolution.
  • Figure 4: Relative norm of the per-vertex descriptors $f_i^n$ in \ref{['eq:per-vertex-descriptors']} extracted by $DN_e$ displayed as a heatmap on a mesh from VOCAset (left), and a mesh from Multiface (right). For each mesh, we show the norm on the original topology, on a remeshed version, and on a further degraded mesh obtained by removing the back of the head and creating random holes. Here, pinker hues indicate lower values, and greener hues indicate higher values.
  • ...and 7 more figures