Table of Contents
Fetching ...

Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds

Oliver Lemke, Zuria Bauer, René Zurbrügg, Marc Pollefeys, Francis Engelmann, Hermann Blum

TL;DR

The paper presents Spot-Compose, a modular framework that combines open-vocabulary 3D instance segmentation (OpenMask3D), grasp pose estimation (AnyGrasp), and adaptive navigation to enable dynamic object retrieval and drawer manipulation in human-centric environments. It demonstrates a real-world pipeline on the Spot robot that localizes arbitrary objects via natural language, computes robust grasp-and-position strategies, and estimates drawer motion axes for access to concealed spaces. The main contributions include a Spot-based integration platform, end-to-end capability for open-vocabulary object interaction in 3D scenes, and empirical results showing 51% grasping success and 82% drawer-search success across varied scenes and objects. This work highlights the practical potential of combining 3D perception, manipulation, and motion planning in commodity scanners and mobile robots to operate in everyday human environments.

Abstract

In recent years, modern techniques in deep learning and large-scale datasets have led to impressive progress in 3D instance segmentation, grasp pose estimation, and robotics. This allows for accurate detection directly in 3D scenes, object- and environment-aware grasp prediction, as well as robust and repeatable robotic manipulation. This work aims to integrate these recent methods into a comprehensive framework for robotic interaction and manipulation in human-centric environments. Specifically, we leverage 3D reconstructions from a commodity 3D scanner for open-vocabulary instance segmentation, alongside grasp pose estimation, to demonstrate dynamic picking of objects, and opening of drawers. We show the performance and robustness of our model in two sets of real-world experiments including dynamic object retrieval and drawer opening, reporting a 51% and 82% success rate respectively. Code of our framework as well as videos are available on: https://spot-compose.github.io/.

Spot-Compose: A Framework for Open-Vocabulary Object Retrieval and Drawer Manipulation in Point Clouds

TL;DR

The paper presents Spot-Compose, a modular framework that combines open-vocabulary 3D instance segmentation (OpenMask3D), grasp pose estimation (AnyGrasp), and adaptive navigation to enable dynamic object retrieval and drawer manipulation in human-centric environments. It demonstrates a real-world pipeline on the Spot robot that localizes arbitrary objects via natural language, computes robust grasp-and-position strategies, and estimates drawer motion axes for access to concealed spaces. The main contributions include a Spot-based integration platform, end-to-end capability for open-vocabulary object interaction in 3D scenes, and empirical results showing 51% grasping success and 82% drawer-search success across varied scenes and objects. This work highlights the practical potential of combining 3D perception, manipulation, and motion planning in commodity scanners and mobile robots to operate in everyday human environments.

Abstract

In recent years, modern techniques in deep learning and large-scale datasets have led to impressive progress in 3D instance segmentation, grasp pose estimation, and robotics. This allows for accurate detection directly in 3D scenes, object- and environment-aware grasp prediction, as well as robust and repeatable robotic manipulation. This work aims to integrate these recent methods into a comprehensive framework for robotic interaction and manipulation in human-centric environments. Specifically, we leverage 3D reconstructions from a commodity 3D scanner for open-vocabulary instance segmentation, alongside grasp pose estimation, to demonstrate dynamic picking of objects, and opening of drawers. We show the performance and robustness of our model in two sets of real-world experiments including dynamic object retrieval and drawer opening, reporting a 51% and 82% success rate respectively. Code of our framework as well as videos are available on: https://spot-compose.github.io/.
Paper Structure (10 sections, 5 equations, 3 figures, 1 table)

This paper contains 10 sections, 5 equations, 3 figures, 1 table.

Figures (3)

  • Figure 1: Overview of the Spot-Compose pipeline. Given a previously acquired point cloud, we segment the scene and localize the wanted object via a natural language query. For object retrieval (top), we isolate the object to determine the most effective grasp. For drawer manipulation (bottom) we use the cabinet position to point our camera for 2D drawer detection.
  • Figure 2: Adaptive grasping and drawer interaction pipeline. On the left, we illustrate the grasping sequence initiated by the successful localization of the watering can through 3D instance segmentation. Following this, an optimal robot positioning is computed by the navigation planner and the object is grasped. The right side of the figure details the drawer detection and manipulation process. Multiple images are captured for robust detection. Subsequently, the robot is maneuvered into position to facilitate drawer opening. This dual-phase approach demonstrates the integration of object detection, navigation planning, and execution within a dynamic scene. On the respective sides we illustrate example objects and handles in various levels of difficulty.
  • Figure 3: Grasping experiment results. To evaluate the grasping capability of our framework, we conduct 59 trial runs across six different scenes and with 13 distinct objects. The test includes items and placements of varying difficulty. We observed an overall success rate of 51%, with the highest failure rate occurring in detection and manipulation. Search experiment results. For this evaluation, we conduct 16 runs with six individual drawers. Moreover, we explore combinations of 15 handles and 19 objects, experimenting with various pairings. We observe a 82% success rate, with the majority of failure cases being connected to bad perception, especially an inaccurate Time-of-Flight depth sensor.