Table of Contents
Fetching ...

Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar

Cong Wang, Di Kang, Yan-Pei Cao, Linchao Bao, Ying Shan, Song-Hai Zhang

TL;DR

This paper proposes Neural Point-based Volumetric Avatar (NPVA), a method that adopts the neural point representation as well as the neural volume rendering process and discards the predefined connectivity and hard correspondence imposed by mesh-based approaches.

Abstract

Rendering photorealistic and dynamically moving human heads is crucial for ensuring a pleasant and immersive experience in AR/VR and video conferencing applications. However, existing methods often struggle to model challenging facial regions (e.g., mouth interior, eyes, hair/beard), resulting in unrealistic and blurry results. In this paper, we propose {\fullname} ({\name}), a method that adopts the neural point representation as well as the neural volume rendering process and discards the predefined connectivity and hard correspondence imposed by mesh-based approaches. Specifically, the neural points are strategically constrained around the surface of the target expression via a high-resolution UV displacement map, achieving increased modeling capacity and more accurate control. We introduce three technical innovations to improve the rendering and training efficiency: a patch-wise depth-guided (shading point) sampling strategy, a lightweight radiance decoding process, and a Grid-Error-Patch (GEP) ray sampling strategy during training. By design, our {\name} is better equipped to handle topologically changing regions and thin structures while also ensuring accurate expression control when animating avatars. Experiments conducted on three subjects from the Multiface dataset demonstrate the effectiveness of our designs, outperforming previous state-of-the-art methods, especially in handling challenging facial regions.

Neural Point-based Volumetric Avatar: Surface-guided Neural Points for Efficient and Photorealistic Volumetric Head Avatar

TL;DR

This paper proposes Neural Point-based Volumetric Avatar (NPVA), a method that adopts the neural point representation as well as the neural volume rendering process and discards the predefined connectivity and hard correspondence imposed by mesh-based approaches.

Abstract

Rendering photorealistic and dynamically moving human heads is crucial for ensuring a pleasant and immersive experience in AR/VR and video conferencing applications. However, existing methods often struggle to model challenging facial regions (e.g., mouth interior, eyes, hair/beard), resulting in unrealistic and blurry results. In this paper, we propose {\fullname} ({\name}), a method that adopts the neural point representation as well as the neural volume rendering process and discards the predefined connectivity and hard correspondence imposed by mesh-based approaches. Specifically, the neural points are strategically constrained around the surface of the target expression via a high-resolution UV displacement map, achieving increased modeling capacity and more accurate control. We introduce three technical innovations to improve the rendering and training efficiency: a patch-wise depth-guided (shading point) sampling strategy, a lightweight radiance decoding process, and a Grid-Error-Patch (GEP) ray sampling strategy during training. By design, our {\name} is better equipped to handle topologically changing regions and thin structures while also ensuring accurate expression control when animating avatars. Experiments conducted on three subjects from the Multiface dataset demonstrate the effectiveness of our designs, outperforming previous state-of-the-art methods, especially in handling challenging facial regions.
Paper Structure (31 sections, 7 equations, 9 figures, 3 tables)

This paper contains 31 sections, 7 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Overview of Neural Point-based Volumetric Avatar (NPVA). The core of our approach is a neural point-based volumetric representation (middle), with points distributed around the surface of the target expression. This surface is defined by the low-resolution position map $\hat{\bm{G}}_o$ with intermediate supervision. A high-resolution displacement map $\hat{\bm{G}}_d$ allows the points to adaptively move within a certain range, as needed to provide increased capacity in more challenging regions (e.g., mouth, hair/beard). The attached point features are obtained from the feature map $\hat{\bm{F}}$. $\hat{\bm{G}}_o$, $\hat{\bm{G}}_d$, and $\hat{\bm{F}}$ are decoded from the latent code $\bm{z}$ (left), which is trained in a variational auto-encoding style (encoder omitted). In addition, we propose three technical innovations with the aim of achieving rendering efficiency on par with mesh-based methods (right).
  • Figure 2: Lightweight Radiance Decoding. Given a query point, we find its $K$ nearest neighboring neural points. We weighted sum these points to obtain an "average" feature for the subsequent radiance decoding. Note that we removed the per-point processing MLP used in Point-NeRF. This lightweight decoding process runs faster and obtains better generalization on novel expressions for our dynamic modeling task.
  • Figure 3: Qualitative Comparisons with State-of-the-Art Methods. Our NPVA produces more photorealistic facial renditions on held-out test expressions compared to previous state-of-the-art methods, particularly in challenging facial regions (i.e., eyes, beard, and mouth interior). The normal expression is presented in Row 1, while the extreme expressions are shown in Row 2. The bottom left corners of the leftmost images show the shell thickness of our NPVA. The thickness for a specified face is the variance of point-to-surface distance for the points corresponding to it. Red (blue) indicates larger (smaller) variance. During learning, our NPVA automatically increases capacity (i.e., thicker shell) to better model the more challenging facial regions (e.g., hair and mouth interior).
  • Figure 4: Effect of lightweight radiance decoding. Our lightweight radiance decoding not only gains $\sim 7 \times$ speedup over Point-NeRF's radiance decoding, but also produces better facial renditions for unseen expressions.
  • Figure 5: Effect of different depth-guided shading point sampling strategies. There are two depth levels around the orange mark. With pixel-wise depth sampling DBLP:conf/siggrapha/LinPXYSBZ22, the model generates "mesh-like" artifacts. In contrast, our patch-wise depth sampling samples around both the front and back depth levels, leading to more realistic results.
  • ...and 4 more figures