HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image

Hongsuk Choi; Nikhil Chavan-Dafle; Jiacheng Yuan; Volkan Isler; Hyunsoo Park

HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image

Hongsuk Choi, Nikhil Chavan-Dafle, Jiacheng Yuan, Volkan Isler, Hyunsoo Park

TL;DR

HandNeRF tackles single-image 3D hand–object scene reconstruction by learning a semantic neural radiance field conditioned on a 3D hand shape and 2D object features. A core novelty is explicit hand–object interaction encoding via a 3D CNN that fuses hand and object features into an interaction volume, enabling accurate object geometry reconstruction without relying on 3D object templates. The method achieves state-of-the-art or comparable results on DexYCB and HO-3D v3, generalizes well to novel grasps and unseen objects, and improves downstream tasks such as grasp planning and motion planning. This work demonstrates that incorporating explicit hand geometry priors into implicit representations can robustly regularize plausible hand–object reconstructions from sparse data, with practical impact for robotics and AR/VR applications.

Abstract

This paper presents a method to learn hand-object interaction prior for reconstructing a 3D hand-object scene from a single RGB image. The inference as well as training-data generation for 3D hand-object scene reconstruction is challenging due to the depth ambiguity of a single image and occlusions by the hand and object. We turn this challenge into an opportunity by utilizing the hand shape to constrain the possible relative configuration of the hand and object geometry. We design a generalizable implicit function, HandNeRF, that explicitly encodes the correlation of the 3D hand shape features and 2D object features to predict the hand and object scene geometry. With experiments on real-world datasets, we show that HandNeRF is able to reconstruct hand-object scenes of novel grasp configurations more accurately than comparable methods. Moreover, we demonstrate that object reconstruction from HandNeRF ensures more accurate execution of downstream tasks, such as grasping and motion planning for robotic hand-over and manipulation. Homepage: https://samsunglabs.github.io/HandNeRF-project-page/

HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image

TL;DR

Abstract

Paper Structure (15 sections, 6 equations, 12 figures, 8 tables)

This paper contains 15 sections, 6 equations, 12 figures, 8 tables.

Introduction
Related Work
Method
Modeling Hand-object Interaction
Implementation of HandNeRF
Experiments
Ablation Study
Comparison with State-of-the-art Methods
Limitation and Future work
Conclusion
Additional Qualitative Results
Generalization Results per Object
Comparison with IHOI
Sensitivity Study to External Input
Implementation Detail

Figures (12)

Figure 1: Given a single RGB image of a hand-object interaction scene, HandNeRF predicts the hand and object's density, color, and semantics, which can be converted to reconstruction of 3D hand and object meshes and rendered to novel view images (RGB, depth, and semantic segmentation). HandNeRF learns the correlation between hand and object geometry from different types of hand-object interactions, supervised by sparse view images. HandNeRF is tested on a novel scene with an unseen hand-object interaction.
Figure 2: HandNeRF takes a single RGB image and predicts the volume density, color radiance, and semantic label of each query point in a neural field. Different from comparable works of Ye et al. ye2022s and Choi et al. choi2022mononhr that implicitly learns the interaction between hand and object, it explicitly encodes the correlation between hand and object features in 3D space, which provides more accurate 3D reconstruction and novel view synthesis.
Figure 3: We visualize object reconstruction with the hand estimation from HandOccNet park2022handoccnet. Using explicit hand-object interaction features, HandNeRF generates more accurate reconstruction.
Figure 4: Qualitative results of novel view synthesis (image, depth, and semantic segmentation) and 3D mesh on DexYCB and HO3D v3. Ground truth hand meshes are used as input.
Figure 5: Qualitative results of novel view synthesis (image, depth, and semantic segmentation) and 3D mesh on DexYCB chao2021dexycb and HO3D v3 hampali2020honnotate, given hand mesh estimation of HandOccNet park2022handoccnet. The bottom results for scissors are using ground truth hand mesh for reference.
...and 7 more figures

HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image

TL;DR

Abstract

HandNeRF: Learning to Reconstruct Hand-Object Interaction Scene from a Single RGB Image

Authors

TL;DR

Abstract

Table of Contents

Figures (12)