Table of Contents
Fetching ...

NCRF: Neural Contact Radiance Fields for Free-Viewpoint Rendering of Hand-Object Interaction

Zhongqun Zhang, Jifei Song, Eduardo Pérez-Pellitero, Yiren Zhou, Hyung Jin Chang, Aleš Leonardis

TL;DR

This paper tackles the challenge of photo-realistic free-viewpoint rendering of hand-object interactions from sparse RGB videos. It introduces NCRF, a dynamic hand-object NeRF represented in a canonical space and driven by a hand-object motion field, complemented by a contact optimization field whose Attention-based ContactNet refines hand and object poses through differentiable optimization. A mesh-guided ray sampling strategy and a joint training objective enable accurate pose refinement and high-fidelity rendering under heavy occlusions. Experimental results on HO3D and DexYCB demonstrate state-of-the-art rendering quality and improved hand-object pose accuracy, validating the effectiveness of jointly modeling rendering and contact priors for complex hand-object interactions.

Abstract

Modeling hand-object interactions is a fundamentally challenging task in 3D computer vision. Despite remarkable progress that has been achieved in this field, existing methods still fail to synthesize the hand-object interaction photo-realistically, suffering from degraded rendering quality caused by the heavy mutual occlusions between the hand and the object, and inaccurate hand-object pose estimation. To tackle these challenges, we present a novel free-viewpoint rendering framework, Neural Contact Radiance Field (NCRF), to reconstruct hand-object interactions from a sparse set of videos. In particular, the proposed NCRF framework consists of two key components: (a) A contact optimization field that predicts an accurate contact field from 3D query points for achieving desirable contact between the hand and the object. (b) A hand-object neural radiance field to learn an implicit hand-object representation in a static canonical space, in concert with the specifically designed hand-object motion field to produce observation-to-canonical correspondences. We jointly learn these key components where they mutually help and regularize each other with visual and geometric constraints, producing a high-quality hand-object reconstruction that achieves photo-realistic novel view synthesis. Extensive experiments on HO3D and DexYCB datasets show that our approach outperforms the current state-of-the-art in terms of both rendering quality and pose estimation accuracy.

NCRF: Neural Contact Radiance Fields for Free-Viewpoint Rendering of Hand-Object Interaction

TL;DR

This paper tackles the challenge of photo-realistic free-viewpoint rendering of hand-object interactions from sparse RGB videos. It introduces NCRF, a dynamic hand-object NeRF represented in a canonical space and driven by a hand-object motion field, complemented by a contact optimization field whose Attention-based ContactNet refines hand and object poses through differentiable optimization. A mesh-guided ray sampling strategy and a joint training objective enable accurate pose refinement and high-fidelity rendering under heavy occlusions. Experimental results on HO3D and DexYCB demonstrate state-of-the-art rendering quality and improved hand-object pose accuracy, validating the effectiveness of jointly modeling rendering and contact priors for complex hand-object interactions.

Abstract

Modeling hand-object interactions is a fundamentally challenging task in 3D computer vision. Despite remarkable progress that has been achieved in this field, existing methods still fail to synthesize the hand-object interaction photo-realistically, suffering from degraded rendering quality caused by the heavy mutual occlusions between the hand and the object, and inaccurate hand-object pose estimation. To tackle these challenges, we present a novel free-viewpoint rendering framework, Neural Contact Radiance Field (NCRF), to reconstruct hand-object interactions from a sparse set of videos. In particular, the proposed NCRF framework consists of two key components: (a) A contact optimization field that predicts an accurate contact field from 3D query points for achieving desirable contact between the hand and the object. (b) A hand-object neural radiance field to learn an implicit hand-object representation in a static canonical space, in concert with the specifically designed hand-object motion field to produce observation-to-canonical correspondences. We jointly learn these key components where they mutually help and regularize each other with visual and geometric constraints, producing a high-quality hand-object reconstruction that achieves photo-realistic novel view synthesis. Extensive experiments on HO3D and DexYCB datasets show that our approach outperforms the current state-of-the-art in terms of both rendering quality and pose estimation accuracy.
Paper Structure (12 sections, 14 equations, 4 figures, 4 tables)

This paper contains 12 sections, 14 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: NCRF pipeline. We propose a novel hand-object neural radiance field to model the hand-object interaction. Our framework is composed of 1) a contact optimization field, leveraging contact prior in order to refine the hand-object pose, 2) a hand deformation field to deform hand points from observation to canonical space, considering both the skeletal and non-rigid motion, 3) object deformation field, which transforms the object into canonical space using refined rigid object pose, and 4) the canonical neural radiance field to build canonical volume for hand-object interaction and predict color and density.
  • Figure 2: Structure of contact optimization field. Given initial hand-object poses $\boldsymbol{P}_h^0$, $\boldsymbol{P}_{obj}^0$, we first refine object pose by a residual $\Delta R, \Delta t$ learned from Rigid Pose Correction module (not drawn). Our ContactNet consists of 1) a PointNet++ backbone to extract features for both hand and object, 2) the attention-based cross-feature augment to obtain hand-object interaction representations, and 3) the ConvNet to regress the contact field. The Pose Optimization consists of 1) the Differential Optimization (DiffOpt) module, iteratively updating hand joint and rotation $(J, \Omega)$ conditional on the contact field, 2) the Joint Rotation Correction module to learn a rotation residual $\Delta \Omega$, with which the final refined hand pose is obtained.
  • Figure 3: Qualitative comparison with HumanNeRF weng2022humannerf on the HO3D and DexYCB dataset.
  • Figure 4: Ablation study on the hand-object neural radiance field consists of non-rigid motion (NR), mesh-guided sampling (MS) and contact optimization field (COF) modules. Zoom in for details.