Table of Contents
Fetching ...

EG-HumanNeRF: Efficient Generalizable Human NeRF Utilizing Human Prior for Sparse View

Zhaorong Wang, Yoshihiro Kanamori, Yuki Endo

TL;DR

This work proposes an occlusion-aware attention mechanism to extract occlusion information from the human priors, followed by an image space refinement network to improve rendering quality, and adopts a signed ray distance function (SRDF) formulation for volume rendering.

Abstract

Generalizable neural radiance field (NeRF) enables neural-based digital human rendering without per-scene retraining. When combined with human prior knowledge, high-quality human rendering can be achieved even with sparse input views. However, the inference of these methods is still slow, as a large number of neural network queries on each ray are required to ensure the rendering quality. Moreover, occluded regions often suffer from artifacts, especially when the input views are sparse. To address these issues, we propose a generalizable human NeRF framework that achieves high-quality and real-time rendering with sparse input views by extensively leveraging human prior knowledge. We accelerate the rendering with a two-stage sampling reduction strategy: first constructing boundary meshes around the human geometry to reduce the number of ray samples for sampling guidance regression, and then volume rendering using fewer guided samples. To improve rendering quality, especially in occluded regions, we propose an occlusion-aware attention mechanism to extract occlusion information from the human priors, followed by an image space refinement network to improve rendering quality. Furthermore, for volume rendering, we adopt a signed ray distance function (SRDF) formulation, which allows us to propose an SRDF loss at every sample position to improve the rendering quality further. Our experiments demonstrate that our method outperforms the state-of-the-art methods in rendering quality and has a competitive rendering speed compared with speed-prioritized novel view synthesis methods.

EG-HumanNeRF: Efficient Generalizable Human NeRF Utilizing Human Prior for Sparse View

TL;DR

This work proposes an occlusion-aware attention mechanism to extract occlusion information from the human priors, followed by an image space refinement network to improve rendering quality, and adopts a signed ray distance function (SRDF) formulation for volume rendering.

Abstract

Generalizable neural radiance field (NeRF) enables neural-based digital human rendering without per-scene retraining. When combined with human prior knowledge, high-quality human rendering can be achieved even with sparse input views. However, the inference of these methods is still slow, as a large number of neural network queries on each ray are required to ensure the rendering quality. Moreover, occluded regions often suffer from artifacts, especially when the input views are sparse. To address these issues, we propose a generalizable human NeRF framework that achieves high-quality and real-time rendering with sparse input views by extensively leveraging human prior knowledge. We accelerate the rendering with a two-stage sampling reduction strategy: first constructing boundary meshes around the human geometry to reduce the number of ray samples for sampling guidance regression, and then volume rendering using fewer guided samples. To improve rendering quality, especially in occluded regions, we propose an occlusion-aware attention mechanism to extract occlusion information from the human priors, followed by an image space refinement network to improve rendering quality. Furthermore, for volume rendering, we adopt a signed ray distance function (SRDF) formulation, which allows us to propose an SRDF loss at every sample position to improve the rendering quality further. Our experiments demonstrate that our method outperforms the state-of-the-art methods in rendering quality and has a competitive rendering speed compared with speed-prioritized novel view synthesis methods.

Paper Structure

This paper contains 27 sections, 12 equations, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Our method enables efficient and high-quality rendering of generalizable human NeRF from sparse input views. Our method outperforms the state-of-the-art quality-prioritized methods, e.g., KeypointNeRF keypointnerf, and has competitive rendering speed with the fastest speed-prioritized methods, e.g., GPS-Gaussian gps. Our method also removes artifacts caused by occlusion in the input views, often observed in state-of-the-art methods. As shown in the figure, existing methods yield blurriness and missing body parts (red boxes) due to using sparse input views, and finger-like artifacts near the human's right hand (blue boxes) due to occlusion. In contrast, our method renders high-quality images without artifacts. The metric score and rendering speed in each figure are averaged values over the test set in our experiments.
  • Figure 2: Overview of our method. Given calibrated sparse multi-view images, we encode their feature maps and (1) construct a geometry feature volume from the SMPL-X mesh and the input views to provide human prior knowledge. Utilizing the boundary meshes derived from the SMPL-X mesh, we (2) use a two-stage sampling strategy to reduce the number of samples required and accelerate the rendering while maintaining rendering quality. At each sample position, we (3) regress geometry-related values and appearance features used for rendering, along with an optional occlusion-aware feature to improve rendering quality in occluded regions. We aggregate features on sample positions using an SRDF-based formulation. Finally, we (4) conduct image space refinement from feature maps to synthesize the image in the target view. Our method (5) is trained end-to-end, with optional SRDF and adversarial losses to improve the rendering quality further.
  • Figure 3: Comparisons between inpainting and image space refinement. "w/o refinement" denotes the rendering results without using image space refinement. "w/ inpainting" denotes rendering results with a post hoc inpainting method lama. "w/ refinement" denotes our rendering results with image space refinement using occlusion-aware features. When the input views already provide sufficient information to hallucinate the occluded regions, the inpainting method destroys the original information and is not able to recover complex textures without using a heavyweight model. In contrast, the image space refinement method focus on the occluded regions, preserving the original information and avoiding the need for heavyweight generative models.
  • Figure 4: Qualitative evaluation results on the THuman2.0 dataset (first three rows) and the ZJU-MoCap dataset (last row) compared with the state-of-the-art methods. We use four input views for the THuman2.0 dataset and three input views for the ZJU-MoCap dataset. For GPS-Gaussian gps, we use six input views and ground-truth depth instead of four, as six is the minimum number supported by this method.
  • Figure 5: Qualitative results of the ablation studies. "w/o BM" denotes the results without using the boundary meshes. "w/ DB" denotes the results with the density-based rendering. "w/o depth" denotes the results without using the depth loss. "w/ depth" denotes the results with the depth loss only. "w/ SRDF" denotes the results with the depth loss and the SRDF loss. As shown in the red boxes in the first row, results from the model missing components exhibit artifacts on neck or unnatural face. The results from the model without the SRDF loss exhibit slight artifacts on the pants, as shown in the blue boxes in the first row. In the second row, models except for the one using SRDF loss exhibit different extents of artifacts on face and one hand.
  • ...and 4 more figures