Table of Contents
Fetching ...

Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph

Sergey Linok, Tatiana Zemskova, Svetlana Ladanova, Roman Titkov, Dmitry Yudin, Maxim Monastyrny, Aleksei Valenkov

TL;DR

The paper tackles open-vocabulary 3D object grounding in cluttered indoor environments by building an object-centric 3D map and a compact 3D scene graph with metric and semantic edges. A two-stage deductive reasoning process integrates scene graph descriptions with a large language model to ground objects described by relational natural language queries, without fine-tuning on domain data. Empirical results on Replica, ScanNet, Sr3D+, Nr3D, and ScanRefer show strong performance, with ablations highlighting the value of semantic edges and the two-stage reasoning approach. The work also demonstrates practical efficiency on onboard hardware and provides publicly available code for robotics applications.

Abstract

Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at https://linukc.github.io/BeyondBareQueries/.

Beyond Bare Queries: Open-Vocabulary Object Grounding with 3D Scene Graph

TL;DR

The paper tackles open-vocabulary 3D object grounding in cluttered indoor environments by building an object-centric 3D map and a compact 3D scene graph with metric and semantic edges. A two-stage deductive reasoning process integrates scene graph descriptions with a large language model to ground objects described by relational natural language queries, without fine-tuning on domain data. Empirical results on Replica, ScanNet, Sr3D+, Nr3D, and ScanRefer show strong performance, with ablations highlighting the value of semantic edges and the two-stage reasoning approach. The work also demonstrates practical efficiency on onboard hardware and provides publicly available code for robotics applications.

Abstract

Locating objects described in natural language presents a significant challenge for autonomous agents. Existing CLIP-based open-vocabulary methods successfully perform 3D object grounding with simple (bare) queries, but cannot cope with ambiguous descriptions that demand an understanding of object relations. To tackle this problem, we propose a modular approach called BBQ (Beyond Bare Queries), which constructs 3D scene graph representation with metric and semantic spatial edges and utilizes a large language model as a human-to-agent interface through our deductive scene reasoning algorithm. BBQ employs robust DINO-powered associations to construct 3D object-centric map and an advanced raycasting algorithm with a 2D vision-language model to describe them as graph nodes. On the Replica and ScanNet datasets, we have demonstrated that BBQ takes a leading place in open-vocabulary 3D semantic segmentation compared to other zero-shot methods. Also, we show that leveraging spatial relations is especially effective for scenes containing multiple entities of the same semantic class. On challenging Sr3D+, Nr3D and ScanRefer benchmarks, our deductive approach demonstrates a significant improvement, enabling objects grounding by complex queries compared to other state-of-the-art methods. The combination of our design choices and software implementation has resulted in significant data processing speed in experiments on the robot on-board computer. This promising performance enables the application of our approach in intelligent robotics projects. We made the code publicly available at https://linukc.github.io/BeyondBareQueries/.
Paper Structure (17 sections, 1 equation, 6 figures, 5 tables, 1 algorithm)

This paper contains 17 sections, 1 equation, 6 figures, 5 tables, 1 algorithm.

Figures (6)

  • Figure 1: Proposed BBQ approach leverages foundation models for high-performance construction of an object-centric class-agnostic 3D map of a static indoor environment from a sequence of RGB-D frames with known camera poses and calibration. To perform scene understanding, we represent environment as a set of nodes with spatial relations. Utilizing a designed deductive scene reasoning algorithm, our method enable efficient natural language interaction with a scene-aware large language model.
  • Figure 2: An object-centric class-agnostic 3D map is iteratively constructed from a sequence of RGB-D camera frames and their poses by associating 2D MobileSAMv2 mask proposals with 3D objects with deep DINOv2 visual features and spatial constraints (Sec. \ref{['objects_map']}). To visually represent objects after building the map, we select the best view based on the largest projected mask from $L$ cluster centroids that represent areas of object observations (Sec. \ref{['3Dto2D']}). We leverage LLaVA1.6 liu2023improvedllava and text-aligned visual encoder EVA2 fang2023eva to describe object visual properties (Sec. \ref{['nodes']}). With the node's text descriptions, spatial locations, metric and semantic spatial edges (Sec. \ref{['edges']}) we utilize LLM in our deductive reasoning algorithm (Sec. \ref{['how_to_apply']}) to perform a 3D object grounding task.
  • Figure 3: Qualitative examples of 3D open-vocabulary semantic segmentation on the Replica.
  • Figure 4: Qualitative examples of 3D referred object grounding on the Sr3D+/Nr3D datasets.
  • Figure 5: Husky mobile robot with SensorBox used for real-world experiments
  • ...and 1 more figures