Table of Contents
Fetching ...

Efficient Neural Implicit Representation for 3D Human Reconstruction

Zexu Huang, Sarah Monazam Erfani, Siying Lu, Mingming Gong

TL;DR

This study presents HumanAvatar, an innovative approach that efficiently reconstructs precise human avatars from monocular video sources using the pre-trained HuMoR, a model celebrated for its proficiency in human motion estimation to enhance the reconstruction fidelity and speed.

Abstract

High-fidelity digital human representations are increasingly in demand in the digital world, particularly for interactive telepresence, AR/VR, 3D graphics, and the rapidly evolving metaverse. Even though they work well in small spaces, conventional methods for reconstructing 3D human motion frequently require the use of expensive hardware and have high processing costs. This study presents HumanAvatar, an innovative approach that efficiently reconstructs precise human avatars from monocular video sources. At the core of our methodology, we integrate the pre-trained HuMoR, a model celebrated for its proficiency in human motion estimation. This is adeptly fused with the cutting-edge neural radiance field technology, Instant-NGP, and the state-of-the-art articulated model, Fast-SNARF, to enhance the reconstruction fidelity and speed. By combining these two technologies, a system is created that can render quickly and effectively while also providing estimation of human pose parameters that are unmatched in accuracy. We have enhanced our system with an advanced posture-sensitive space reduction technique, which optimally balances rendering quality with computational efficiency. In our detailed experimental analysis using both artificial and real-world monocular videos, we establish the advanced performance of our approach. HumanAvatar consistently equals or surpasses contemporary leading-edge reconstruction techniques in quality. Furthermore, it achieves these complex reconstructions in minutes, a fraction of the time typically required by existing methods. Our models achieve a training speed that is 110X faster than that of State-of-The-Art (SoTA) NeRF-based models. Our technique performs noticeably better than SoTA dynamic human NeRF methods if given an identical runtime limit. HumanAvatar can provide effective visuals after only 30 seconds of training.

Efficient Neural Implicit Representation for 3D Human Reconstruction

TL;DR

This study presents HumanAvatar, an innovative approach that efficiently reconstructs precise human avatars from monocular video sources using the pre-trained HuMoR, a model celebrated for its proficiency in human motion estimation to enhance the reconstruction fidelity and speed.

Abstract

High-fidelity digital human representations are increasingly in demand in the digital world, particularly for interactive telepresence, AR/VR, 3D graphics, and the rapidly evolving metaverse. Even though they work well in small spaces, conventional methods for reconstructing 3D human motion frequently require the use of expensive hardware and have high processing costs. This study presents HumanAvatar, an innovative approach that efficiently reconstructs precise human avatars from monocular video sources. At the core of our methodology, we integrate the pre-trained HuMoR, a model celebrated for its proficiency in human motion estimation. This is adeptly fused with the cutting-edge neural radiance field technology, Instant-NGP, and the state-of-the-art articulated model, Fast-SNARF, to enhance the reconstruction fidelity and speed. By combining these two technologies, a system is created that can render quickly and effectively while also providing estimation of human pose parameters that are unmatched in accuracy. We have enhanced our system with an advanced posture-sensitive space reduction technique, which optimally balances rendering quality with computational efficiency. In our detailed experimental analysis using both artificial and real-world monocular videos, we establish the advanced performance of our approach. HumanAvatar consistently equals or surpasses contemporary leading-edge reconstruction techniques in quality. Furthermore, it achieves these complex reconstructions in minutes, a fraction of the time typically required by existing methods. Our models achieve a training speed that is 110X faster than that of State-of-The-Art (SoTA) NeRF-based models. Our technique performs noticeably better than SoTA dynamic human NeRF methods if given an identical runtime limit. HumanAvatar can provide effective visuals after only 30 seconds of training.

Paper Structure

This paper contains 21 sections, 18 equations, 10 figures, 3 tables.

Figures (10)

  • Figure 1: HumanAvatar: We present a framework that produces computationally realistic human avatar representations from single-camera video footage, incorporating both poses and facial features. After reconstruction, the avatar can be animated and displayed at a rate of 15 frames per second with a resolution of 540x540 pixels. To accomplish our goals, we combine a pre-trained model specialized in human motion with efficient neural radiance fields, which were initially created for static environments. We also incorporate a rapid articulation correspondence search mechanism. By leveraging an established technique for skipping empty spaces, we enhance both the training and inference speeds, making it possible to learn avatar dynamics within minutes.
  • Figure 2: Model Structure: We estimated the SMPL parameters using HuMoR for each frame and sampled the rays' pose space positions. The global orientation and translation are then subtracted from these points' locations in a normalised space, after which our occupancy vector is used to filter out points in unoccupied space. We use the SMPL body model to optimize the human avatar in the canonical space. To assess the colour and density attributes, additional data points are incorporated into a canonical neural radiance field. This is achieved by utilizing an articulation module that warps these points into the canonical space.
  • Figure 3: Posture-sensitive Space Reduction Procedure. In the inference stage, our method skips redundant sampling.
  • Figure 4: SMPL estimation results of HuMoR rempe2021humor (top) and ROMP sun2021monocular (bottom).
  • Figure 5: SMPL estimation results for HuMoR rempe2021humor (top) and ROMP sun2021monocular (bottom) in partially occluded scenes.
  • ...and 5 more figures