Table of Contents
Fetching ...

PickScan: Object discovery and reconstruction from handheld interactions

Vincent van der Brugge, Marc Pollefeys, Joshua B. Tenenbaum, Ayush Tewari, Krishna Murthy Jatavallabhula

TL;DR

This work presents a novel interaction-guided and class-agnostic method based on object displacements that allows a user to move around a scene with an RGB-D camera, hold up objects, and finally outputs one 3D model per held-up object.

Abstract

Reconstructing compositional 3D representations of scenes, where each object is represented with its own 3D model, is a highly desirable capability in robotics and augmented reality. However, most existing methods rely heavily on strong appearance priors for object discovery, therefore only working on those classes of objects on which the method has been trained, or do not allow for object manipulation, which is necessary to scan objects fully and to guide object discovery in challenging scenarios. We address these limitations with a novel interaction-guided and class-agnostic method based on object displacements that allows a user to move around a scene with an RGB-D camera, hold up objects, and finally outputs one 3D model per held-up object. Our main contribution to this end is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects. On a custom-captured dataset, our pipeline discovers manipulated objects with 78.3% precision at 100% recall and reconstructs them with a mean chamfer distance of 0.90cm. Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73% while detecting 99% fewer false positives.

PickScan: Object discovery and reconstruction from handheld interactions

TL;DR

This work presents a novel interaction-guided and class-agnostic method based on object displacements that allows a user to move around a scene with an RGB-D camera, hold up objects, and finally outputs one 3D model per held-up object.

Abstract

Reconstructing compositional 3D representations of scenes, where each object is represented with its own 3D model, is a highly desirable capability in robotics and augmented reality. However, most existing methods rely heavily on strong appearance priors for object discovery, therefore only working on those classes of objects on which the method has been trained, or do not allow for object manipulation, which is necessary to scan objects fully and to guide object discovery in challenging scenarios. We address these limitations with a novel interaction-guided and class-agnostic method based on object displacements that allows a user to move around a scene with an RGB-D camera, hold up objects, and finally outputs one 3D model per held-up object. Our main contribution to this end is a novel approach to detecting user-object interactions and extracting the masks of manipulated objects. On a custom-captured dataset, our pipeline discovers manipulated objects with 78.3% precision at 100% recall and reconstructs them with a mean chamfer distance of 0.90cm. Compared to Co-Fusion, the only comparable interaction-based and class-agnostic baseline, this corresponds to a reduction in chamfer distance of 73% while detecting 99% fewer false positives.

Paper Structure

This paper contains 16 sections, 2 equations, 8 figures, 1 table.

Figures (8)

  • Figure 1: We present PickScan, an interaction-guided and class-agnostic pipeline for compositional scene reconstruction. Our method lets a user pick up and move around objects, and outputs the object masks, 3D model and per-frame poses of each manipulated object.
  • Figure 2: Example illustrating the benefit of user-object interactions for object discovery. Toy bricks of different sizes are placed next to each other in a 4 by 3 grid, out of which a 4 by 1 brick is picked up. Left: The closest mask found by a state-of-the-art segmentation network kirillov_segmentanything_2023 in the static scene in terms of intersection over union with the actual 4 by 1 brick's mask. Right: Object mask discovered by leveraging the handheld interaction, using our method.
  • Figure 3: Overview of our pipeline and breakdown of its two phases: candidate object mask detection (green) and the object interaction phase (purple). Our contributions are highlighted in yellow while existing components are shown with white filling.
  • Figure 4: Interaction detection visualized: User-object interactions are detected as periods in which the distance between the hand point cloud and the initial point cloud (blue) crosses over the distance between the hand point cloud and the candidate object point cloud (orange), before crossing below it once again. In this example, detected interactions are highlighted in alternating shades of grey.
  • Figure 5: Scenes 1 to 3, from left to right, of our dataset. The images show the scenes in their initial state before any manipulation has taken place.
  • ...and 3 more figures