Table of Contents
Fetching ...

TexHOI: Reconstructing Textures of 3D Unknown Objects in Monocular Hand-Object Interaction Scenes

Alakh Aggarwal, Ningna Wang, Xiaohu Guo

TL;DR

TexHOI tackles monocular texture reconstruction in dynamic hand-object scenes by separating pose refinement from texture illumination, using a two-stage approach. Stage 1 employs compositional NeRFs for hand and object pose refinement and low-fidelity geometry learning, while Stage 2 applies SG-based physics rendering to recover albedo and lighting, explicitly modeling hand occlusion with 108 parameterized spheres. The method disentangles intrinsic object texture from hand shadows and environmental illumination, outperforming state-of-the-art texture reconstruction baselines and enabling lighting-robust albedo prediction. This yields more realistic renderings across varying viewpoints and lighting, with potential benefits for AR/VR realism and robotic perception.

Abstract

Reconstructing 3D models of dynamic, real-world objects with high-fidelity textures from monocular frame sequences has been a challenging problem in recent years. This difficulty stems from factors such as shadows, indirect illumination, and inaccurate object-pose estimations due to occluding hand-object interactions. To address these challenges, we propose a novel approach that predicts the hand's impact on environmental visibility and indirect illumination on the object's surface albedo. Our method first learns the geometry and low-fidelity texture of the object, hand, and background through composite rendering of radiance fields. Simultaneously, we optimize the hand and object poses to achieve accurate object-pose estimations. We then refine physics-based rendering parameters - including roughness, specularity, albedo, hand visibility, skin color reflections, and environmental illumination - to produce precise albedo, and accurate hand illumination and shadow regions. Our approach surpasses state-of-the-art methods in texture reconstruction and, to the best of our knowledge, is the first to account for hand-object interactions in object texture reconstruction.

TexHOI: Reconstructing Textures of 3D Unknown Objects in Monocular Hand-Object Interaction Scenes

TL;DR

TexHOI tackles monocular texture reconstruction in dynamic hand-object scenes by separating pose refinement from texture illumination, using a two-stage approach. Stage 1 employs compositional NeRFs for hand and object pose refinement and low-fidelity geometry learning, while Stage 2 applies SG-based physics rendering to recover albedo and lighting, explicitly modeling hand occlusion with 108 parameterized spheres. The method disentangles intrinsic object texture from hand shadows and environmental illumination, outperforming state-of-the-art texture reconstruction baselines and enabling lighting-robust albedo prediction. This yields more realistic renderings across varying viewpoints and lighting, with potential benefits for AR/VR realism and robotic perception.

Abstract

Reconstructing 3D models of dynamic, real-world objects with high-fidelity textures from monocular frame sequences has been a challenging problem in recent years. This difficulty stems from factors such as shadows, indirect illumination, and inaccurate object-pose estimations due to occluding hand-object interactions. To address these challenges, we propose a novel approach that predicts the hand's impact on environmental visibility and indirect illumination on the object's surface albedo. Our method first learns the geometry and low-fidelity texture of the object, hand, and background through composite rendering of radiance fields. Simultaneously, we optimize the hand and object poses to achieve accurate object-pose estimations. We then refine physics-based rendering parameters - including roughness, specularity, albedo, hand visibility, skin color reflections, and environmental illumination - to produce precise albedo, and accurate hand illumination and shadow regions. Our approach surpasses state-of-the-art methods in texture reconstruction and, to the best of our knowledge, is the first to account for hand-object interactions in object texture reconstruction.
Paper Structure (29 sections, 29 equations, 10 figures, 2 tables)

This paper contains 29 sections, 29 equations, 10 figures, 2 tables.

Figures (10)

  • Figure 1: The overall pipeline of our proposed TexHOI method. In the first stage (Sec. \ref{['sec:stage1']}), the hand and object poses are fine-tuned along with composite radiance fields of hand, object, and background. Using the predicted object segmentation and object geometry mask, in the second stage (Sec. \ref{['sec:stage2']}), the optimized hand and object poses are used to accurately learn material properties on the object surface, i.e. albedo, BRDF, and hand occlusions, using physics-based rendering with Spherical Gaussian approximations.
  • Figure 2: After Stage 1, for each (a) input image, predicted (b) object mask $M_{obj}$ and (c) hand-object mask $M_{ho}$ are calculated.
  • Figure 3: Canonical MANO hand is packed with $108$ parameterizable spheres for hand-occlusion computation.
  • Figure 4: Physics-based rendering calculates an integral over a hemispherical region, centered around the surface normal of a surface point, to calculate the final color of the surface point. Based on SG approximation, direct illumination is defined as a sum of $128$ SGs, hand occlusion is calculated based on the parameterizable spherical representation of MANO hand, indirect illumination from occluding hand is a learned parameter, albedo represents the base color of the object without hand shadows or environment reflections, and specularity of the object is calculated based on the material properties - roughness, specular reflectance, etc.
  • Figure 5: An SG hemispherical lobe, centered around the Z-axis, is divided into 64x64 patches. An occluding sphere projected onto the SG lobe covers some patches. The occluded patches are calculated, and their fractional value is calculated to get hand occlusion.
  • ...and 5 more figures