NCRF: Neural Contact Radiance Fields for Free-Viewpoint Rendering of Hand-Object Interaction
Zhongqun Zhang, Jifei Song, Eduardo Pérez-Pellitero, Yiren Zhou, Hyung Jin Chang, Aleš Leonardis
TL;DR
This paper tackles the challenge of photo-realistic free-viewpoint rendering of hand-object interactions from sparse RGB videos. It introduces NCRF, a dynamic hand-object NeRF represented in a canonical space and driven by a hand-object motion field, complemented by a contact optimization field whose Attention-based ContactNet refines hand and object poses through differentiable optimization. A mesh-guided ray sampling strategy and a joint training objective enable accurate pose refinement and high-fidelity rendering under heavy occlusions. Experimental results on HO3D and DexYCB demonstrate state-of-the-art rendering quality and improved hand-object pose accuracy, validating the effectiveness of jointly modeling rendering and contact priors for complex hand-object interactions.
Abstract
Modeling hand-object interactions is a fundamentally challenging task in 3D computer vision. Despite remarkable progress that has been achieved in this field, existing methods still fail to synthesize the hand-object interaction photo-realistically, suffering from degraded rendering quality caused by the heavy mutual occlusions between the hand and the object, and inaccurate hand-object pose estimation. To tackle these challenges, we present a novel free-viewpoint rendering framework, Neural Contact Radiance Field (NCRF), to reconstruct hand-object interactions from a sparse set of videos. In particular, the proposed NCRF framework consists of two key components: (a) A contact optimization field that predicts an accurate contact field from 3D query points for achieving desirable contact between the hand and the object. (b) A hand-object neural radiance field to learn an implicit hand-object representation in a static canonical space, in concert with the specifically designed hand-object motion field to produce observation-to-canonical correspondences. We jointly learn these key components where they mutually help and regularize each other with visual and geometric constraints, producing a high-quality hand-object reconstruction that achieves photo-realistic novel view synthesis. Extensive experiments on HO3D and DexYCB datasets show that our approach outperforms the current state-of-the-art in terms of both rendering quality and pose estimation accuracy.
