GHNeRF: Learning Generalizable Human Features with Efficient Neural Radiance Fields

Arnab Dey; Di Yang; Rohith Agaram; Antitza Dantcheva; Andrew I. Comport; Srinath Sridhar; Jean Martinet

GHNeRF: Learning Generalizable Human Features with Efficient Neural Radiance Fields

Arnab Dey, Di Yang, Rohith Agaram, Antitza Dantcheva, Andrew I. Comport, Srinath Sridhar, Jean Martinet

TL;DR

GHNeRF presents a generalizable NeRF framework that jointly learns neural radiance fields and human biomechanic features (2D/3D joint locations, dense pose) from sparse 2D images by fusing pixel-aligned image features with a dedicated human feature stream. The method extends NeRF with a heatmap branch that predicts joints, enabling accurate 2D/3D keypoint estimation while maintaining high-quality novel-view synthesis. Evaluations on ZJU_MoCap and RenderPeople demonstrate state-of-the-art joint estimation performance and competitive rendering quality, with DensePose capabilities and ablations confirming the benefits of a DINO-based human encoder and heatmap-based supervision. GHNeRF enables real-time rendering pathways and offers a pathway to learn additional biomechanic properties beyond joints, promising improved AR/VR avatars and animation pipelines with minimal per-scene supervision.

Abstract

Recent advances in Neural Radiance Fields (NeRF) have demonstrated promising results in 3D scene representations, including 3D human representations. However, these representations often lack crucial information on the underlying human pose and structure, which is crucial for AR/VR applications and games. In this paper, we introduce a novel approach, termed GHNeRF, designed to address these limitations by learning 2D/3D joint locations of human subjects with NeRF representation. GHNeRF uses a pre-trained 2D encoder streamlined to extract essential human features from 2D images, which are then incorporated into the NeRF framework in order to encode human biomechanic features. This allows our network to simultaneously learn biomechanic features, such as joint locations, along with human geometry and texture. To assess the effectiveness of our method, we conduct a comprehensive comparison with state-of-the-art human NeRF techniques and joint estimation algorithms. Our results show that GHNeRF can achieve state-of-the-art results in near real-time.

GHNeRF: Learning Generalizable Human Features with Efficient Neural Radiance Fields

TL;DR

Abstract

Paper Structure (28 sections, 6 equations, 16 figures, 7 tables)

This paper contains 28 sections, 6 equations, 16 figures, 7 tables.

Introduction
Releted works
NeRF for 3D representation
NeRF for human representation
Human pose estimation
Method
Preliminaries
Feature extraction
Learning human features with NeRF
Keypoints extraction
Experimental Results
Experimental Setting
Performance on novel view synthesis and joint estimation
Performance on dense human pose estimation
Rendering speed
...and 13 more sections

Figures (16)

Figure 1: In this work we propose GHNeRF, it can simultaneously learns both neural radiance fields and human features from sparse images. (a) shows high quality novel-view renderings. (b) shows generalizable human feature(keypoints, dense pose, etc.) estimated by GHNeRF. (c) present interactive tool to render free-viewpoint videos of novel-view and human features.
Figure 2: Overview of the GHNeRF pipeline: Given an input image $I$, human features $f_\textbf{h}$ and multi-resolution image features $f_{img}$ can be extracted using a 2D image encoder and a 2D CNN respectively. Subsequently, $f_{img}$ is used to form a cost volume for depth prediction. The predicted depth is used for depth-guided sampling to reduce the number of samples along the ray. For each 3D sample point $x$ along the ray, we combine image and voxel features to input an MLP $g_{NeRF}$, generating the intermediate NeRF feature $V_{NeRF}$. Finally, the intermediate NeRF feature $V_{NeRF}$ and the human feature $f_\textbf{h}$ are concatenated and fed into a smaller MLP $g_h$ to produce heatmaps. Furthermore, $V_{NeRF}$ and the view direction $\textbf{d}$ are combined in another MLP $g_c$ to derive color $c$. The final pixel color and heatmaps are generated using volume rendering technique.
Figure 3: Qualitative comparison of generalization results on ZJU_MoCap unseen test sequence.
Figure 4: Qualitative result of keypoint estimation on ZJU_MoCap dataset.
Figure 5: Qualitative result of keypoint estimation on RenderPeople dataset.
...and 11 more figures

GHNeRF: Learning Generalizable Human Features with Efficient Neural Radiance Fields

TL;DR

Abstract

GHNeRF: Learning Generalizable Human Features with Efficient Neural Radiance Fields

Authors

TL;DR

Abstract

Table of Contents

Figures (16)