Table of Contents
Fetching ...

HINT: Learning Complete Human Neural Representations from Limited Viewpoints

Alessandro Sanvito, Andrea Ramazzina, Stefanie Walz, Mario Bijelic, Felix Heide

TL;DR

HINT addresses the challenge of reconstructing complete human avatars from limited viewpoints by splitting the scene into a background NeRF and a canonical-space, SDF-based human model, guided by a sagittal-plane symmetry prior and supervised by depth and segmentation cues. A co-trained Human Digitization Network (HDN) provides priors for unseen views, with targeted losses (including a novel SDF-based supervision) that prevent geometry collapse and promote realistic textures. Quantitative results show substantial gains over prior methods, with PSNR improvements of around $15 ext{%}$ and LPIPS reductions of about $34 ext{%}$, demonstrating robust novel-view synthesis from sparse data. The approach enables complete, animatable human representations in real-world, limited-view robotics scenarios, facilitating data augmentation, counterfactual generation, and safer autonomous operation in dynamic environments.

Abstract

No augmented application is possible without animated humanoid avatars. At the same time, generating human replicas from real-world monocular hand-held or robotic sensor setups is challenging due to the limited availability of views. Previous work showed the feasibility of virtual avatars but required the presence of 360 degree views of the targeted subject. To address this issue, we propose HINT, a NeRF-based algorithm able to learn a detailed and complete human model from limited viewing angles. We achieve this by introducing a symmetry prior, regularization constraints, and training cues from large human datasets. In particular, we introduce a sagittal plane symmetry prior to the appearance of the human, directly supervise the density function of the human model using explicit 3D body modeling, and leverage a co-learned human digitization network as additional supervision for the unseen angles. As a result, our method can reconstruct complete humans even from a few viewing angles, increasing performance by more than 15% PSNR compared to previous state-of-the-art algorithms.

HINT: Learning Complete Human Neural Representations from Limited Viewpoints

TL;DR

HINT addresses the challenge of reconstructing complete human avatars from limited viewpoints by splitting the scene into a background NeRF and a canonical-space, SDF-based human model, guided by a sagittal-plane symmetry prior and supervised by depth and segmentation cues. A co-trained Human Digitization Network (HDN) provides priors for unseen views, with targeted losses (including a novel SDF-based supervision) that prevent geometry collapse and promote realistic textures. Quantitative results show substantial gains over prior methods, with PSNR improvements of around and LPIPS reductions of about , demonstrating robust novel-view synthesis from sparse data. The approach enables complete, animatable human representations in real-world, limited-view robotics scenarios, facilitating data augmentation, counterfactual generation, and safer autonomous operation in dynamic environments.

Abstract

No augmented application is possible without animated humanoid avatars. At the same time, generating human replicas from real-world monocular hand-held or robotic sensor setups is challenging due to the limited availability of views. Previous work showed the feasibility of virtual avatars but required the presence of 360 degree views of the targeted subject. To address this issue, we propose HINT, a NeRF-based algorithm able to learn a detailed and complete human model from limited viewing angles. We achieve this by introducing a symmetry prior, regularization constraints, and training cues from large human datasets. In particular, we introduce a sagittal plane symmetry prior to the appearance of the human, directly supervise the density function of the human model using explicit 3D body modeling, and leverage a co-learned human digitization network as additional supervision for the unseen angles. As a result, our method can reconstruct complete humans even from a few viewing angles, increasing performance by more than 15% PSNR compared to previous state-of-the-art algorithms.
Paper Structure (16 sections, 23 equations, 4 figures, 2 tables)

This paper contains 16 sections, 23 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Top row: a typical real-world scene with a passing pedestrian along a moving observing camera, only offering limited views for reconstruction. Second row: the reconstruction of the human. Our method is the only one able to reconstruct the human, despite one side being entirely unseen. Lastly, the third row shows a rendering of the human and the scene with a new trajectory toward the observing camera.
  • Figure 2: The proposed model architecture comprises a Neural Rendering approach sampling the positions $\mathbf{x}$ along each camera ray $\mathbf{r}$. The positions are then split into the sets $X_h, X_{bkgr}$ as being part of the human $X_h$ or the background and modeled independently through two NeRFs $f_{bkgr}, f_{h}$. Modeling the human builds upon an SDF $s$, which requires the marching cube algorithm for surface estimation and rendering. The background can be rendered with volume rendering. The representations are supervised with the losses $\mathcal{L}_{depth},\mathcal{L}_{mask},\mathcal{L}_{SDF},\mathcal{L}_{base},\mathcal{L}_{symm}, \mathcal{L}_{HDN}$ detailed in \ref{['subsec:backgroundlosses', 'subsec:symmetry_loss', 'subsec:sdf_loss', 'subsec:pifu_loss']}. Additionally, the auxiliary networks $g_v, g_c, f_M, h_v, h_c, f_D$ are shown predicting auxiliary training information as masks and depth, as well as providing the foundational human shape knowledge for $\mathcal{L}_{HDN}$. The pre-trained weights of the Digital Human are refined through the loss $\mathcal{L}_{refine}$.
  • Figure 3: Qualitative comparison of HINT, NeuMan jiang2022neuman and Vid2Avatar guo2023vid2avatar for novel human pose renderings (left) and insertions into the scene background (right). Our proposed approach generates a consistent 3D representation of the human, while state-of-the-art methods are not able to handle unseen poses and viewing angles, leading to artifacts on the human's side and back marked with red boxes in the canonical representation.
  • Figure 4: Qualitative results of reconstructed images on the test set.