Table of Contents
Fetching ...

FreeTalk: Emotional Topology-Free 3D Talking Heads

Federico Nocentini, Thomas Besnier, Claudio Ferrari, Stefano Berretti, Mohamed Daoudi

Abstract

Speech-driven 3D facial animation has advanced rapidly, yet most approaches remain tied to registered template meshes, preventing effective deployment on raw 3D scans with arbitrary topology. At the same time, modeling controllable emotional dynamics beyond lip articulation remains challenging, and is often tied to template-based parameterizations. We address these challenges by proposing FreeTalk, a two-stage framework for emotion-conditioned 3D talking-head animation that generalizes to unregistered face meshes with arbitrary vertex count and connectivity. First, Audio-To-Sparse (ATS) predicts a temporally coherent sequence of 3D landmark displacements from speech audio, conditioned on an emotion category and intensity. This sparse representation captures both articulatory and affective motion while remaining independent of mesh topology. Second, Sparse-To-Mesh (STM) transfers the predicted landmark motion to a target mesh by combining intrinsic surface features with landmark-to-vertex conditioning, producing dense per-vertex deformations without template fitting or correspondence supervision at test time. Extensive experiments show that FreeTalk matches specialized baselines when trained in-domain, while providing substantially improved robustness to unseen identities and mesh topologies. Code and pre-trained models will be made publicly available.

FreeTalk: Emotional Topology-Free 3D Talking Heads

Abstract

Speech-driven 3D facial animation has advanced rapidly, yet most approaches remain tied to registered template meshes, preventing effective deployment on raw 3D scans with arbitrary topology. At the same time, modeling controllable emotional dynamics beyond lip articulation remains challenging, and is often tied to template-based parameterizations. We address these challenges by proposing FreeTalk, a two-stage framework for emotion-conditioned 3D talking-head animation that generalizes to unregistered face meshes with arbitrary vertex count and connectivity. First, Audio-To-Sparse (ATS) predicts a temporally coherent sequence of 3D landmark displacements from speech audio, conditioned on an emotion category and intensity. This sparse representation captures both articulatory and affective motion while remaining independent of mesh topology. Second, Sparse-To-Mesh (STM) transfers the predicted landmark motion to a target mesh by combining intrinsic surface features with landmark-to-vertex conditioning, producing dense per-vertex deformations without template fitting or correspondence supervision at test time. Extensive experiments show that FreeTalk matches specialized baselines when trained in-domain, while providing substantially improved robustness to unseen identities and mesh topologies. Code and pre-trained models will be made publicly available.
Paper Structure (28 sections, 16 equations, 7 figures, 5 tables)

This paper contains 28 sections, 16 equations, 7 figures, 5 tables.

Figures (7)

  • Figure 1: FreeTalk takes an unregistered static face mesh (in blue on the left) and an emotional audio speech and predicts a sequence of deformation fields (in cyan) matching speech-related motions and expression-related deformations.
  • Figure 2: Overview of the proposed approach. ATS takes the speech audio signal $A$ and generates a sequence of landmarks displacements $(\Delta L_t)_{t=1}^{T}$. Then, STM takes a static neutral mesh $M$ and maps the sparse motion sequence of landmarks $(\Delta L_t)_{t=1}^{T}$ to dense vertex motions to predict the final animated mesh sequence $(\hat{M}_t)_{t=1}^{T}$.
  • Figure 3: Overview of the Sparse-To-Mesh (STM) module. STM predicts vertex-wise features on a static mesh $M$, and landmark features from displacement vectors $L$. The mesh features are enhanced by a learnable mapping that injects a global embedding of the landmark displacements into each vertex via cross-attention.
  • Figure 4: Qualitative comparison on the MEAD-EMOTE test set. Each panel shows a sequence of rendered frames for a single identity. Left: Disgust; right: Happy. Numbered settings follow the definition in Table \ref{['tab:quantitative_comparison']}.
  • Figure 5: Results on unseen topologies.
  • ...and 2 more figures