Table of Contents
Fetching ...

Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis

Zicheng Zhang, Ruobing Zheng, Ziwen Liu, Congying Han, Tianqi Li, Meng Wang, Tiande Guo, Jingdong Chen, Bonan Li, Ming Yang

TL;DR

DynTet introduces a dynamic tetrahedral (DynTet) representation that couples explicit mesh topology with neural deformation, enabling high-fidelity, lip-synced talking-head synthesis from short videos. By decoupling topology from geometry and employing elastic scores, DynTet maintains stable, controllable deformations while leveraging a canonical projection and 3DMM priors to guide texture and shape learning. Meshing uses a fixed tetrahedral grid, SDF-based surface extraction via Marching Tetrahedra, and a differentiable rasterizer for fast rendering with physically-based materials and environment lighting. Quantitative and qualitative results show improvements over state-of-the-art NeRF-based and 2D/3DMM methods in fidelity, lip synchronization, and runtime, with the ability to output dynamic meshes suitable for AR/VR and content creation. DynTet thus offers a practical, extensible path toward real-time, high-quality dynamic head avatars with explicit geometry control.

Abstract

Recent works in implicit representations, such as Neural Radiance Fields (NeRF), have advanced the generation of realistic and animatable head avatars from video sequences. These implicit methods are still confronted by visual artifacts and jitters, since the lack of explicit geometric constraints poses a fundamental challenge in accurately modeling complex facial deformations. In this paper, we introduce Dynamic Tetrahedra (DynTet), a novel hybrid representation that encodes explicit dynamic meshes by neural networks to ensure geometric consistency across various motions and viewpoints. DynTet is parameterized by the coordinate-based networks which learn signed distance, deformation, and material texture, anchoring the training data into a predefined tetrahedra grid. Leveraging Marching Tetrahedra, DynTet efficiently decodes textured meshes with a consistent topology, enabling fast rendering through a differentiable rasterizer and supervision via a pixel loss. To enhance training efficiency, we incorporate classical 3D Morphable Models to facilitate geometry learning and define a canonical space for simplifying texture learning. These advantages are readily achievable owing to the effective geometric representation employed in DynTet. Compared with prior works, DynTet demonstrates significant improvements in fidelity, lip synchronization, and real-time performance according to various metrics. Beyond producing stable and visually appealing synthesis videos, our method also outputs the dynamic meshes which is promising to enable many emerging applications.

Learning Dynamic Tetrahedra for High-Quality Talking Head Synthesis

TL;DR

DynTet introduces a dynamic tetrahedral (DynTet) representation that couples explicit mesh topology with neural deformation, enabling high-fidelity, lip-synced talking-head synthesis from short videos. By decoupling topology from geometry and employing elastic scores, DynTet maintains stable, controllable deformations while leveraging a canonical projection and 3DMM priors to guide texture and shape learning. Meshing uses a fixed tetrahedral grid, SDF-based surface extraction via Marching Tetrahedra, and a differentiable rasterizer for fast rendering with physically-based materials and environment lighting. Quantitative and qualitative results show improvements over state-of-the-art NeRF-based and 2D/3DMM methods in fidelity, lip synchronization, and runtime, with the ability to output dynamic meshes suitable for AR/VR and content creation. DynTet thus offers a practical, extensible path toward real-time, high-quality dynamic head avatars with explicit geometry control.

Abstract

Recent works in implicit representations, such as Neural Radiance Fields (NeRF), have advanced the generation of realistic and animatable head avatars from video sequences. These implicit methods are still confronted by visual artifacts and jitters, since the lack of explicit geometric constraints poses a fundamental challenge in accurately modeling complex facial deformations. In this paper, we introduce Dynamic Tetrahedra (DynTet), a novel hybrid representation that encodes explicit dynamic meshes by neural networks to ensure geometric consistency across various motions and viewpoints. DynTet is parameterized by the coordinate-based networks which learn signed distance, deformation, and material texture, anchoring the training data into a predefined tetrahedra grid. Leveraging Marching Tetrahedra, DynTet efficiently decodes textured meshes with a consistent topology, enabling fast rendering through a differentiable rasterizer and supervision via a pixel loss. To enhance training efficiency, we incorporate classical 3D Morphable Models to facilitate geometry learning and define a canonical space for simplifying texture learning. These advantages are readily achievable owing to the effective geometric representation employed in DynTet. Compared with prior works, DynTet demonstrates significant improvements in fidelity, lip synchronization, and real-time performance according to various metrics. Beyond producing stable and visually appealing synthesis videos, our method also outputs the dynamic meshes which is promising to enable many emerging applications.
Paper Structure (41 sections, 16 equations, 8 figures, 3 tables)

This paper contains 41 sections, 16 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: Illustration of the proposed DynTet for modeling a talking head. Left panel: The top branch predicts topology-related information, including SDF values $s$ and elastic scores. The bottom branch, conditioned by talking signals, predicts the offset vectors scaled by the elastic scores to yield deformation vectors $\Delta \mathbf{v}$ for driving the tetrahedral grid. Then, the Marching Tetrahedra (MT) algorithm decodes the meshes. Right panel: The pixel coordinates are projected into a canonical space. Then the physically-based materials and lighting are sampled for rendering images.
  • Figure 2: Qualitative comparison of DynTet with the prior methods zhou2021pcavsprajwal2020wav2liptang2022radLi2023EfficientRN. Some representative defects are marked with red arrows, around which the generated eyes, mouths or wrinkles highlight discrepancies with real ones. The right panel presents the details of the mouth and eye area. The results show the superior realism and motion accuracy achieved by DynTet compared to existing methods.
  • Figure 3: The triangular meshes from 3DMM Blanz1999AMM, ER-NeRF Li2023EfficientRN and DynTet. The surfaces extracted from ER-NeRF using the Marching Cubes Lorensen1987MarchingCA exhibit noise and undesirable topology. Note that the backs of the meshes may have some defects due to limited training data, but it does not impact the rendering results.
  • Figure 4: Qualitative results of the cross-driven setting. The top and bottom panels show the frame- and audio-driven results, respectively. We attach the estimated 3DMM shapes for reference.
  • Figure 5: The validation of key components in DynTet. (a) Result of DynTet. (b) Replacing frequency encoding with hash encoding muller2022instant. (c) Removing 3DMM supervision. (d) Removing canonical projection. (e) Groundtruth. (f) Visualization of elastic scores.
  • ...and 3 more figures