DPHMs: Diffusion Parametric Head Models for Depth-based Tracking
Jiapeng Tang, Angela Dai, Yinyu Nie, Lev Markhasin, Justus Thies, Matthias Niessner
TL;DR
This paper tackles robust reconstruction and tracking of 3D head geometries from monocular depth sequences, a task made difficult by partial observations and sensor noise. It introduces Diffusion Parametric Head Models (DPHMs), which couple Neural Parametric Head Models (NPHMs) with diffusion-based priors to regularize identity and expression latents onto plausible manifolds during test-time optimization. The work presents novel components: a backward deformation-based expression space, a two-part latent diffusion model (identity and expression), and diffusion-based regularizers that improve temporal coherence and geometric plausibility. A new DPHM-Kinect dataset with challenging expressive motion is introduced, and extensive experiments show superior head geometry accuracy and more robust expression tracking compared to state-of-the-art methods, including ablation studies validating each design choice. The approach has practical impact for accessible, high-fidelity head avatars from consumer depth sensors, with potential benefits for AR/VR, telepresence, and digital twins.
Abstract
We introduce Diffusion Parametric Head Models (DPHMs), a generative model that enables robust volumetric head reconstruction and tracking from monocular depth sequences. While recent volumetric head models, such as NPHMs, can now excel in representing high-fidelity head geometries, tracking and reconstructing heads from real-world single-view depth sequences remains very challenging, as the fitting to partial and noisy observations is underconstrained. To tackle these challenges, we propose a latent diffusion-based prior to regularize volumetric head reconstruction and tracking. This prior-based regularizer effectively constrains the identity and expression codes to lie on the underlying latent manifold which represents plausible head shapes. To evaluate the effectiveness of the diffusion-based prior, we collect a dataset of monocular Kinect sequences consisting of various complex facial expression motions and rapid transitions. We compare our method to state-of-the-art tracking methods and demonstrate improved head identity reconstruction as well as robust expression tracking.
