Table of Contents
Fetching ...

Structured Spatial Reasoning with Open Vocabulary Object Detectors

Negar Nejatishahidin, Madhukar Reddy Vongala, Jana Kosecka

TL;DR

A structured probabilistic approach that integrates rich 3D geometric features with state-of-the-art open-vocabulary object detectors to enhance spatial reasoning for robotic perception is introduced.

Abstract

Reasoning about spatial relationships between objects is essential for many real-world robotic tasks, such as fetch-and-delivery, object rearrangement, and object search. The ability to detect and disambiguate different objects and identify their location is key to successful completion of these tasks. Several recent works have used powerful Vision and Language Models (VLMs) to unlock this capability in robotic agents. In this paper we introduce a structured probabilistic approach that integrates rich 3D geometric features with state-of-the-art open-vocabulary object detectors to enhance spatial reasoning for robotic perception. The approach is evaluated and compared against zero-shot performance of the state-of-the-art Vision and Language Models (VLMs) on spatial reasoning tasks. To enable this comparison, we annotate spatial clauses in real-world RGB-D Active Vision Dataset [1] and conduct experiments on this and the synthetic Semantic Abstraction [2] dataset. Results demonstrate the effectiveness of the proposed method, showing superior performance of grounding spatial relations over state of the art open-source VLMs by more than 20%.

Structured Spatial Reasoning with Open Vocabulary Object Detectors

TL;DR

A structured probabilistic approach that integrates rich 3D geometric features with state-of-the-art open-vocabulary object detectors to enhance spatial reasoning for robotic perception is introduced.

Abstract

Reasoning about spatial relationships between objects is essential for many real-world robotic tasks, such as fetch-and-delivery, object rearrangement, and object search. The ability to detect and disambiguate different objects and identify their location is key to successful completion of these tasks. Several recent works have used powerful Vision and Language Models (VLMs) to unlock this capability in robotic agents. In this paper we introduce a structured probabilistic approach that integrates rich 3D geometric features with state-of-the-art open-vocabulary object detectors to enhance spatial reasoning for robotic perception. The approach is evaluated and compared against zero-shot performance of the state-of-the-art Vision and Language Models (VLMs) on spatial reasoning tasks. To enable this comparison, we annotate spatial clauses in real-world RGB-D Active Vision Dataset [1] and conduct experiments on this and the synthetic Semantic Abstraction [2] dataset. Results demonstrate the effectiveness of the proposed method, showing superior performance of grounding spatial relations over state of the art open-source VLMs by more than 20%.

Paper Structure

This paper contains 11 sections, 2 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Examples showing inputs and desired outputs. The input consists of a triplet <target object, spatial relation, reference object> and corresponding image. The output is the target object bounding box.
  • Figure 2: This is an overview of our pipeline. It consists of three main modules: first, the Object Proposal Module (OPM), which provides a set of boxes as candidates for the target and reference objects; second, the Spatial Relation Module (SRM), which outputs a distribution over possible relationships for each pair; and third, the Probabilistic Ranking Module (PRM), which identifies the best triplet.
  • Figure 3: The image and depth data are masked with the object mask to compute the 3D point cloud. PCA fits a box to the point cloud, and the 6D pose and bounding box dimensions of both objects are concatenated as inputs to the MLP, which outputs a distribution over spatial relation classes.
  • Figure 4: Examples from Semantic Abstraction dataset. (a) Door on the left of the dresser. (b) Spoon on the right of the apple. (c) Basketball on the right of desk. The dataset showcases challenges arising from small object sizes, occlusions, and clutter. Each data sample has semantic segmentation, depth images, ground truth mask of objects, and the expressions.
  • Figure 5: In part (a), we generate instance segmentation using auto-labeling Li2023LabelingIS to obtain object pairs. In part (b), we generate object point clouds, remove outliers, and fit 3D bounding boxes. Finally, in part (c), we generate spatial reasoning expressions based on the 3D information.
  • ...and 2 more figures