Refined Geometry-guided Head Avatar Reconstruction from Monocular RGB Video
Pilseo Park, Ze Zhang, Michel Sarkis, Ning Bi, Xiaoming Liu, Yiying Tong
TL;DR
This paper tackles high-fidelity head avatar reconstruction from monocular RGB video by introducing a two-phase framework that first trains a 3DMM-informed NeRF using an initial FLAME/DECA mesh and latent vertex codes, then refines the geometry by building an SDF from the NeRF density and perturbing the mesh along normals with Laplacian smoothing on the displacement. A second-phase NeRF is then trained with the refined mesh to capture fine facial details, guided by a geometry-aware latent field. Quantitative and qualitative results across six subjects and multiple datasets show state-of-the-art or near-state-of-the-art performance in L$_1$, PSNR, SSIM, and LPIPS, with clear improvements in eye and mouth regions and expression fidelity. The work demonstrates that incorporating NeRF-derived geometry through SDF-based mesh refinement can significantly enhance photorealistic head rendering from monocular input, with practical implications for VR/AR and telepresence. It also highlights avenues for future work, including applying the bootstrapped geometry framework to other 3DMMs and broader scenes.
Abstract
High-fidelity reconstruction of head avatars from monocular videos is highly desirable for virtual human applications, but it remains a challenge in the fields of computer graphics and computer vision. In this paper, we propose a two-phase head avatar reconstruction network that incorporates a refined 3D mesh representation. Our approach, in contrast to existing methods that rely on coarse template-based 3D representations derived from 3DMM, aims to learn a refined mesh representation suitable for a NeRF that captures complex facial nuances. In the first phase, we train 3DMM-stored NeRF with an initial mesh to utilize geometric priors and integrate observations across frames using a consistent set of latent codes. In the second phase, we leverage a novel mesh refinement procedure based on an SDF constructed from the density field of the initial NeRF. To mitigate the typical noise in the NeRF density field without compromising the features of the 3DMM, we employ Laplace smoothing on the displacement field. Subsequently, we apply a second-phase training with these refined meshes, directing the learning process of the network towards capturing intricate facial details. Our experiments demonstrate that our method further enhances the NeRF rendering based on the initial mesh and achieves performance superior to state-of-the-art methods in reconstructing high-fidelity head avatars with such input.
