Table of Contents
Fetching ...

Refined Geometry-guided Head Avatar Reconstruction from Monocular RGB Video

Pilseo Park, Ze Zhang, Michel Sarkis, Ning Bi, Xiaoming Liu, Yiying Tong

TL;DR

This paper tackles high-fidelity head avatar reconstruction from monocular RGB video by introducing a two-phase framework that first trains a 3DMM-informed NeRF using an initial FLAME/DECA mesh and latent vertex codes, then refines the geometry by building an SDF from the NeRF density and perturbing the mesh along normals with Laplacian smoothing on the displacement. A second-phase NeRF is then trained with the refined mesh to capture fine facial details, guided by a geometry-aware latent field. Quantitative and qualitative results across six subjects and multiple datasets show state-of-the-art or near-state-of-the-art performance in L$_1$, PSNR, SSIM, and LPIPS, with clear improvements in eye and mouth regions and expression fidelity. The work demonstrates that incorporating NeRF-derived geometry through SDF-based mesh refinement can significantly enhance photorealistic head rendering from monocular input, with practical implications for VR/AR and telepresence. It also highlights avenues for future work, including applying the bootstrapped geometry framework to other 3DMMs and broader scenes.

Abstract

High-fidelity reconstruction of head avatars from monocular videos is highly desirable for virtual human applications, but it remains a challenge in the fields of computer graphics and computer vision. In this paper, we propose a two-phase head avatar reconstruction network that incorporates a refined 3D mesh representation. Our approach, in contrast to existing methods that rely on coarse template-based 3D representations derived from 3DMM, aims to learn a refined mesh representation suitable for a NeRF that captures complex facial nuances. In the first phase, we train 3DMM-stored NeRF with an initial mesh to utilize geometric priors and integrate observations across frames using a consistent set of latent codes. In the second phase, we leverage a novel mesh refinement procedure based on an SDF constructed from the density field of the initial NeRF. To mitigate the typical noise in the NeRF density field without compromising the features of the 3DMM, we employ Laplace smoothing on the displacement field. Subsequently, we apply a second-phase training with these refined meshes, directing the learning process of the network towards capturing intricate facial details. Our experiments demonstrate that our method further enhances the NeRF rendering based on the initial mesh and achieves performance superior to state-of-the-art methods in reconstructing high-fidelity head avatars with such input.

Refined Geometry-guided Head Avatar Reconstruction from Monocular RGB Video

TL;DR

This paper tackles high-fidelity head avatar reconstruction from monocular RGB video by introducing a two-phase framework that first trains a 3DMM-informed NeRF using an initial FLAME/DECA mesh and latent vertex codes, then refines the geometry by building an SDF from the NeRF density and perturbing the mesh along normals with Laplacian smoothing on the displacement. A second-phase NeRF is then trained with the refined mesh to capture fine facial details, guided by a geometry-aware latent field. Quantitative and qualitative results across six subjects and multiple datasets show state-of-the-art or near-state-of-the-art performance in L, PSNR, SSIM, and LPIPS, with clear improvements in eye and mouth regions and expression fidelity. The work demonstrates that incorporating NeRF-derived geometry through SDF-based mesh refinement can significantly enhance photorealistic head rendering from monocular input, with practical implications for VR/AR and telepresence. It also highlights avenues for future work, including applying the bootstrapped geometry framework to other 3DMMs and broader scenes.

Abstract

High-fidelity reconstruction of head avatars from monocular videos is highly desirable for virtual human applications, but it remains a challenge in the fields of computer graphics and computer vision. In this paper, we propose a two-phase head avatar reconstruction network that incorporates a refined 3D mesh representation. Our approach, in contrast to existing methods that rely on coarse template-based 3D representations derived from 3DMM, aims to learn a refined mesh representation suitable for a NeRF that captures complex facial nuances. In the first phase, we train 3DMM-stored NeRF with an initial mesh to utilize geometric priors and integrate observations across frames using a consistent set of latent codes. In the second phase, we leverage a novel mesh refinement procedure based on an SDF constructed from the density field of the initial NeRF. To mitigate the typical noise in the NeRF density field without compromising the features of the 3DMM, we employ Laplace smoothing on the displacement field. Subsequently, we apply a second-phase training with these refined meshes, directing the learning process of the network towards capturing intricate facial details. Our experiments demonstrate that our method further enhances the NeRF rendering based on the initial mesh and achieves performance superior to state-of-the-art methods in reconstructing high-fidelity head avatars with such input.

Paper Structure

This paper contains 23 sections, 11 equations, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Given a monocular RGB video, we train a 3DMM-stored NeRF to generate a refined mesh, which is subsequently used to render a photorealistic head avatar. In contrast to relying on coarse 3DMM template-based geometric representations, our approach captures individualized details, facilitating improved NeRF learning towards refined geometry.
  • Figure 2: Framework overview. In phase one, we obtain one initial template mesh per frame from a pre-trained DECA. We then attach latent codes on mesh vertices to represent the local appearance and geometry of the head, and train the NeRF. In phase two, we construct an SDF from the trained NeRF model and use its density field to refine the mesh through denoised perturbation. Subsequently, we apply second-phase NeRF training and employ volume rendering to generate the final head avatar.
  • Figure 3: Denoised mesh perturbation. (a) From $M$ sample points along the ray, we identify the displacement amount $\mathbf{S}$, which represents the distance to the surface from the initial vertex $v$. We then perturb $v$ in the direction of the normal $\mathbf{\hat{n}}$ using $\mathbf{S}$ to obtain $v'$. (b) To apply displacement-only smoothing, we first obtain the displacement vector field $\mathbf{D}$. For vertex $i$ and its neighbor $j$, $\alpha_{ij}$ and $\delta_{ij}$ represent the angles opposite the edge between them, which are used to assemble the geometric Laplacian matrix $\mathbf{L}$.
  • Figure 4: Qualitative comparisons with SOTA methods. Starting from the top left and moving down to the bottom right, each represents subjects 1, 2, 3, 4, 5, and 6, respectively. Our method demonstrates better reconstruction results and captures detailed expressions and facial structures, particularly in the eye and mouth regions.
  • Figure 5: (a) Each row represents different frames. 'Per.' denotes mesh perturbation and 'Smo.' denotes Laplacian smoothing. Applying both enhances the depiction of subtle facial movements. (b) The same abbreviations as in (a) are used. The model without both perturbation and smoothing demonstrates limitations in accurately capturing the facial details, and shows artifacts around the eye region, while applying the proposed method enhances the depiction of subtle facial movements.
  • ...and 1 more figures