TexHOI: Reconstructing Textures of 3D Unknown Objects in Monocular Hand-Object Interaction Scenes

Alakh Aggarwal; Ningna Wang; Xiaohu Guo

TexHOI: Reconstructing Textures of 3D Unknown Objects in Monocular Hand-Object Interaction Scenes

Alakh Aggarwal, Ningna Wang, Xiaohu Guo

TL;DR

TexHOI tackles monocular texture reconstruction in dynamic hand-object scenes by separating pose refinement from texture illumination, using a two-stage approach. Stage 1 employs compositional NeRFs for hand and object pose refinement and low-fidelity geometry learning, while Stage 2 applies SG-based physics rendering to recover albedo and lighting, explicitly modeling hand occlusion with 108 parameterized spheres. The method disentangles intrinsic object texture from hand shadows and environmental illumination, outperforming state-of-the-art texture reconstruction baselines and enabling lighting-robust albedo prediction. This yields more realistic renderings across varying viewpoints and lighting, with potential benefits for AR/VR realism and robotic perception.

Abstract

Reconstructing 3D models of dynamic, real-world objects with high-fidelity textures from monocular frame sequences has been a challenging problem in recent years. This difficulty stems from factors such as shadows, indirect illumination, and inaccurate object-pose estimations due to occluding hand-object interactions. To address these challenges, we propose a novel approach that predicts the hand's impact on environmental visibility and indirect illumination on the object's surface albedo. Our method first learns the geometry and low-fidelity texture of the object, hand, and background through composite rendering of radiance fields. Simultaneously, we optimize the hand and object poses to achieve accurate object-pose estimations. We then refine physics-based rendering parameters - including roughness, specularity, albedo, hand visibility, skin color reflections, and environmental illumination - to produce precise albedo, and accurate hand illumination and shadow regions. Our approach surpasses state-of-the-art methods in texture reconstruction and, to the best of our knowledge, is the first to account for hand-object interactions in object texture reconstruction.

TexHOI: Reconstructing Textures of 3D Unknown Objects in Monocular Hand-Object Interaction Scenes

TL;DR

Abstract

Paper Structure (29 sections, 29 equations, 10 figures, 2 tables)

This paper contains 29 sections, 29 equations, 10 figures, 2 tables.

Introduction
Related Works
Traditional Inverse Rendering for 3D Object Reconstruction
Neural Representation in Inverse Rendering
Physics-based Inverse Rendering
Handling Dynamic Elements in Multi-Object Interaction
Preliminaries
Volumetric Rendering
Surface Rendering
Methodology
Stage 1: Compositional Neural Radiance Field for Pose Refinement
Object NeRF
Hand NeRF
Compositional NeRF
Pose Refinement
...and 14 more sections

Figures (10)

Figure 1: The overall pipeline of our proposed TexHOI method. In the first stage (Sec. \ref{['sec:stage1']}), the hand and object poses are fine-tuned along with composite radiance fields of hand, object, and background. Using the predicted object segmentation and object geometry mask, in the second stage (Sec. \ref{['sec:stage2']}), the optimized hand and object poses are used to accurately learn material properties on the object surface, i.e. albedo, BRDF, and hand occlusions, using physics-based rendering with Spherical Gaussian approximations.
Figure 2: After Stage 1, for each (a) input image, predicted (b) object mask $M_{obj}$ and (c) hand-object mask $M_{ho}$ are calculated.
Figure 3: Canonical MANO hand is packed with $108$ parameterizable spheres for hand-occlusion computation.
Figure 4: Physics-based rendering calculates an integral over a hemispherical region, centered around the surface normal of a surface point, to calculate the final color of the surface point. Based on SG approximation, direct illumination is defined as a sum of $128$ SGs, hand occlusion is calculated based on the parameterizable spherical representation of MANO hand, indirect illumination from occluding hand is a learned parameter, albedo represents the base color of the object without hand shadows or environment reflections, and specularity of the object is calculated based on the material properties - roughness, specular reflectance, etc.
Figure 5: An SG hemispherical lobe, centered around the Z-axis, is divided into 64x64 patches. An occluding sphere projected onto the SG lobe covers some patches. The occluded patches are calculated, and their fractional value is calculated to get hand occlusion.
...and 5 more figures

TexHOI: Reconstructing Textures of 3D Unknown Objects in Monocular Hand-Object Interaction Scenes

TL;DR

Abstract

TexHOI: Reconstructing Textures of 3D Unknown Objects in Monocular Hand-Object Interaction Scenes

Authors

TL;DR

Abstract

Table of Contents

Figures (10)