Table of Contents
Fetching ...

3D Reconstruction of Objects in Hands without Real World 3D Supervision

Aditya Prakash, Matthew Chang, Matthew Jin, Ruisen Tu, Saurabh Gupta

TL;DR

This work tackles 3D reconstruction of hand-held objects from a single image without real-world 3D supervision by leveraging two complementary sources: in-the-wild video-derived 2D masks and synthetic 3D shape priors. It introduces an occupancy-network framework (HORSE) trained with 2D mask guided sampling and a novel 2D slice based discriminator to enforce plausible shape priors, enabling robust generalization to novel objects. The approach is trained on ObMan-derived priors and VISOR-based 2D supervision, and constructs a new Wild Objects in Hands dataset to support in-the-wild learning. Empirical results show HORSE surpasses 3D-supervised baselines on MOW by about 11.6% in object generalization, highlighting the value of indirect 3D cues for scalable, real-world 3D reconstruction of hand-held objects.

Abstract

Prior works for reconstructing hand-held objects from a single image train models on images paired with 3D shapes. Such data is challenging to gather in the real world at scale. Consequently, these approaches do not generalize well when presented with novel objects in in-the-wild settings. While 3D supervision is a major bottleneck, there is an abundance of a) in-the-wild raw video data showing hand-object interactions and b) synthetic 3D shape collections. In this paper, we propose modules to leverage 3D supervision from these sources to scale up the learning of models for reconstructing hand-held objects. Specifically, we extract multiview 2D mask supervision from videos and 3D shape priors from shape collections. We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image. Our experiments in the challenging object generalization setting on in-the-wild MOW dataset show 11.6% relative improvement over models trained with 3D supervision on existing datasets.

3D Reconstruction of Objects in Hands without Real World 3D Supervision

TL;DR

This work tackles 3D reconstruction of hand-held objects from a single image without real-world 3D supervision by leveraging two complementary sources: in-the-wild video-derived 2D masks and synthetic 3D shape priors. It introduces an occupancy-network framework (HORSE) trained with 2D mask guided sampling and a novel 2D slice based discriminator to enforce plausible shape priors, enabling robust generalization to novel objects. The approach is trained on ObMan-derived priors and VISOR-based 2D supervision, and constructs a new Wild Objects in Hands dataset to support in-the-wild learning. Empirical results show HORSE surpasses 3D-supervised baselines on MOW by about 11.6% in object generalization, highlighting the value of indirect 3D cues for scalable, real-world 3D reconstruction of hand-held objects.

Abstract

Prior works for reconstructing hand-held objects from a single image train models on images paired with 3D shapes. Such data is challenging to gather in the real world at scale. Consequently, these approaches do not generalize well when presented with novel objects in in-the-wild settings. While 3D supervision is a major bottleneck, there is an abundance of a) in-the-wild raw video data showing hand-object interactions and b) synthetic 3D shape collections. In this paper, we propose modules to leverage 3D supervision from these sources to scale up the learning of models for reconstructing hand-held objects. Specifically, we extract multiview 2D mask supervision from videos and 3D shape priors from shape collections. We use these indirect 3D cues to train occupancy networks that predict the 3D shape of objects from a single RGB image. Our experiments in the challenging object generalization setting on in-the-wild MOW dataset show 11.6% relative improvement over models trained with 3D supervision on existing datasets.
Paper Structure (15 sections, 5 equations, 8 figures, 9 tables)

This paper contains 15 sections, 5 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: We propose modules to extract supervision from in-the-wild videos (Sec. \ref{['sec:dummy-segm-sup']}) & learn shape priors from 3D object collections (Sec. \ref{['sec:dummy-contact-sup']}), to train occupancy networks which predict the 3D shapes of hand-held objects from a single image. This circumvents the need for paired real world 3D shape supervision used in existing works ye2022ihoihasson19_obman.
  • Figure 2: Registering objects via hand pose and 2D Mask guided 3D sampling. (a) Consider unposed frames from in-the-wild videos. (b) We use hand pose from FrankMocap rong2020frankmocap as a proxy for object pose, thereby registering the different views. (c) We then use 2D object masks for labeling 3D points with occupancy (Sec. \ref{['sec:dummy-segm-sup']}). 3D points that project into the object mask in all views are considered as occupied (green triangles), all other points are considered unoccupied (red crosses). (3D object in the figure is for visualization only, not used for sampling.)
  • Figure 2: HO3D Object generalization. We outperform AC-OCC & AC-SDF trained on different datasets with 3D supervision.
  • Figure 3: 2D slice based 3D discriminator. We learn data-driven 3D shape priors using hand-held objects from ObMan dataset. We sample planes through the object (shown above in blue), resulting in a 2D cross-section map. We pass occupancy predictions on points from these cross-sections through a discriminator which tries to distinguish cross-sections of predicted 3D shapes from cross-sections of ObMan objects (Sec. \ref{['sec:dummy-contact-sup']}).
  • Figure 4: VISOR visualizations. Using existing hand pose estimation techniques rong2020frankmocap, we are able to track the objects in relation to hands through time in in-the-wild videos. We visualize these tracks along with object masks from the VISOR dataset darkhalil2022visor. This form of data, where objects move rigidly relative to hands, is used to train our model to learn 3D shape of hand-held objects.
  • ...and 3 more figures