Table of Contents
Fetching ...

DPHMs: Diffusion Parametric Head Models for Depth-based Tracking

Jiapeng Tang, Angela Dai, Yinyu Nie, Lev Markhasin, Justus Thies, Matthias Niessner

TL;DR

This paper tackles robust reconstruction and tracking of 3D head geometries from monocular depth sequences, a task made difficult by partial observations and sensor noise. It introduces Diffusion Parametric Head Models (DPHMs), which couple Neural Parametric Head Models (NPHMs) with diffusion-based priors to regularize identity and expression latents onto plausible manifolds during test-time optimization. The work presents novel components: a backward deformation-based expression space, a two-part latent diffusion model (identity and expression), and diffusion-based regularizers that improve temporal coherence and geometric plausibility. A new DPHM-Kinect dataset with challenging expressive motion is introduced, and extensive experiments show superior head geometry accuracy and more robust expression tracking compared to state-of-the-art methods, including ablation studies validating each design choice. The approach has practical impact for accessible, high-fidelity head avatars from consumer depth sensors, with potential benefits for AR/VR, telepresence, and digital twins.

Abstract

We introduce Diffusion Parametric Head Models (DPHMs), a generative model that enables robust volumetric head reconstruction and tracking from monocular depth sequences. While recent volumetric head models, such as NPHMs, can now excel in representing high-fidelity head geometries, tracking and reconstructing heads from real-world single-view depth sequences remains very challenging, as the fitting to partial and noisy observations is underconstrained. To tackle these challenges, we propose a latent diffusion-based prior to regularize volumetric head reconstruction and tracking. This prior-based regularizer effectively constrains the identity and expression codes to lie on the underlying latent manifold which represents plausible head shapes. To evaluate the effectiveness of the diffusion-based prior, we collect a dataset of monocular Kinect sequences consisting of various complex facial expression motions and rapid transitions. We compare our method to state-of-the-art tracking methods and demonstrate improved head identity reconstruction as well as robust expression tracking.

DPHMs: Diffusion Parametric Head Models for Depth-based Tracking

TL;DR

This paper tackles robust reconstruction and tracking of 3D head geometries from monocular depth sequences, a task made difficult by partial observations and sensor noise. It introduces Diffusion Parametric Head Models (DPHMs), which couple Neural Parametric Head Models (NPHMs) with diffusion-based priors to regularize identity and expression latents onto plausible manifolds during test-time optimization. The work presents novel components: a backward deformation-based expression space, a two-part latent diffusion model (identity and expression), and diffusion-based regularizers that improve temporal coherence and geometric plausibility. A new DPHM-Kinect dataset with challenging expressive motion is introduced, and extensive experiments show superior head geometry accuracy and more robust expression tracking compared to state-of-the-art methods, including ablation studies validating each design choice. The approach has practical impact for accessible, high-fidelity head avatars from consumer depth sensors, with potential benefits for AR/VR, telepresence, and digital twins.

Abstract

We introduce Diffusion Parametric Head Models (DPHMs), a generative model that enables robust volumetric head reconstruction and tracking from monocular depth sequences. While recent volumetric head models, such as NPHMs, can now excel in representing high-fidelity head geometries, tracking and reconstructing heads from real-world single-view depth sequences remains very challenging, as the fitting to partial and noisy observations is underconstrained. To tackle these challenges, we propose a latent diffusion-based prior to regularize volumetric head reconstruction and tracking. This prior-based regularizer effectively constrains the identity and expression codes to lie on the underlying latent manifold which represents plausible head shapes. To evaluate the effectiveness of the diffusion-based prior, we collect a dataset of monocular Kinect sequences consisting of various complex facial expression motions and rapid transitions. We compare our method to state-of-the-art tracking methods and demonstrate improved head identity reconstruction as well as robust expression tracking.
Paper Structure (51 sections, 21 equations, 18 figures, 7 tables)

This paper contains 51 sections, 21 equations, 18 figures, 7 tables.

Figures (18)

  • Figure 1: DPHMs for depth-based tracking. Given a sequence of depth maps $\mathcal{I}$ of N frames, our objective is to reconstruct a full-head avatar $\mathcal{O}$ including its expression transitions. To achieve this, we optimize the parametric latent $\mathcal{Z} = \{ \Vec{z}^{id}, \Vec{z}^{ex}_1, ..., \Vec{z}^{ex}_N \}$ of NPHM that can be decoded into continuous signed distance fields $\mathcal{O}$ by identity and expression decoders. To align with the observations, we calculate data terms $L_{sdf}$ and $L_{norm}$ between $\mathcal{I}$ and $\mathcal{O}$. However, high-level noise still makes navigating the latent optimization extremely challenging. At the core of our method is an effective latent regularization using diffusion priors; we add Gaussian noises to $\mathcal{Z}$ and then pass them into identity and expression diffusion models to predict perturbed noise $\mathcal{\epsilon}$ for updating $\mathcal{Z}$. The diffusion regularizer guides $\Vec{z}^{id}$ and ${\Vec{z}^{ex}_i}$ towards the individual manifold of their distributions via $\epsilon^{id}$ and $\epsilon^{ex}$, ensuring plausible head geometry reconstruction and robust tracking. To enhance temporal coherence, $L_{temp}$ penalizes inconsistency between ${\Vec{z}^{ex}_i}$ of nearby frames.
  • Figure 2: An example of captured DPHM-Kinect sequences with complex facial expressions and fast transitions.
  • Figure 3: Head Tracking on the DPHM-Kinect dataset. Note that RGB images are only used for reference not used by all the methods except ImAvatar. Compared to state-of-the-art methods, our approach achieves more accurate identity reconstruction with detailed hair geometries while tracking more plausible expressions, even during extreme mouth movements.
  • Figure 4: Head Reconstruction and Tracking on the single-view depth sequences of NerSemble kirschstein2023nersemble. Note that RGB images are only used for reference and not used by all methods except ImAvatar. Compared to state-of-the-art methods, our approach demonstrates the ability to reconstruct realistic head avatars with hairs and accurately capture intricate facial expressions such as eyelid movements.
  • Figure 5: Ablation Studies (a) RGB reference & Input Scans; (b) Ours with forward deformations; (c) Ours with VAE priors; (d) Ours without expression diffusion; (e) Ours without identity diffusion; (f) Ours. Note that RGB images are only used for reference not used by all the methods except ImAvatar. We visualize the scan2mesh distance error map at the bottom. Our final model captures complicated expressions with lower identity reconstruction errors.
  • ...and 13 more figures