Table of Contents
Fetching ...

Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM

Laksh Nanwani, Kumaraditya Gupta, Aditya Mathur, Swayam Agrawal, A. H. Abdul Hafez, K. Madhava Krishna

TL;DR

O3D-SIM introduces an open-set 3D semantic instance map for vision-language navigation, extending prior 2D closed-set representations to a 3D space with per-instance embeddings derived from open-set models. The pipeline combines RAM, Grounding DINO, SAM, CLIP, and DINOv2 to produce per-object masks and embeddings, which are back-projected into 3D and incrementally clustered into a unified map. This open-set 3D representation enhances grounding of natural language commands, including unseen object categories, and improves navigation success rates in both Matterport3D/Habitat and real-world experiments. The work demonstrates that integrating foundational vision-language models with a 3D semantic map can approach human-like understanding for complex, instance-specific queries in realistic environments.

Abstract

Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work, SI Maps (Nanwani L, Agarwal A, Jain K, et al. Instance-level semantic maps for vision language navigation. In: 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE; 2023 Aug.), showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify. Project Page - https://smart-wheelchair-rrc.github.io/o3d-sim-webpage

Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM

TL;DR

O3D-SIM introduces an open-set 3D semantic instance map for vision-language navigation, extending prior 2D closed-set representations to a 3D space with per-instance embeddings derived from open-set models. The pipeline combines RAM, Grounding DINO, SAM, CLIP, and DINOv2 to produce per-object masks and embeddings, which are back-projected into 3D and incrementally clustered into a unified map. This open-set 3D representation enhances grounding of natural language commands, including unseen object categories, and improves navigation success rates in both Matterport3D/Habitat and real-world experiments. The work demonstrates that integrating foundational vision-language models with a 3D semantic map can approach human-like understanding for complex, instance-specific queries in realistic environments.

Abstract

Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work, SI Maps (Nanwani L, Agarwal A, Jain K, et al. Instance-level semantic maps for vision language navigation. In: 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE; 2023 Aug.), showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify. Project Page - https://smart-wheelchair-rrc.github.io/o3d-sim-webpage
Paper Structure (19 sections, 2 equations, 7 figures, 1 table)

This paper contains 19 sections, 2 equations, 7 figures, 1 table.

Figures (7)

  • Figure 1: We carry out complex instance-specific goal navigation in object-rich environments. These language queries refer to individual instances based on spatial and viewpoint configuration concerning other objects of the same type while preserving the navigation performance on standard language queries.
  • Figure 2: An overview of the proposed 3D mapping pipeline. Labels generated by the RAM model are input into Grounding DINO to generate bounding boxes for the detected labels. Subsequently, instance masks are created using the SAM model, while CLIP and DINOv2 embeddings are extracted in parallel. These masks, along with the semantic embeddings, are back-projected into 3D space to identify 3D instances. These instances are then refined using a density-based clustering algorithm to produce the O3D-SIM. Figure\ref{['fig:clustering']} shows how the object instance clustering works as part of O3D-SIM.
  • Figure 3: Clustering in 3D of object instances using semantic embeddings and volumetric overlap. Semantic similarity is verified using CLIP and DINOv2 embeddings. Volumetric overlap is calculated using 3D bounding boxes and overlap matrices. The above figure shows the clustering process between two camera poses, represented by the coloured frustum, and the corresponding object point clouds are the same colour. For the O3D-SIM pipeline, this process is repeated for each camera pose, and the objects from each frame are either merged into an existing instance or added as a new instance into the incrementally built representation. The first example shows a positive case of merging, where a comparison is made for the same instance of a chair from different poses. The chairs are merged into a single instance due to success in both semantic similarity and volumetric overlap. The second example shows a case where the table (red) and the skateboard (blue), being very close to each other, have a volumetric overlap but are not merged due to a failure in semantic similarity, creating 2 different object instances. Example 3 shows an example of two separate chairs near each other. Since they are 2 different chairs, they fail to have a volumetric overlap, creating 2 separate instances.
  • Figure 4: This figure shows the difference in output from ChatGPT due to the difference in nature of the two mapping approaches, where SI-Maps is closed-set, and O3D-SIM is open-set. For queries specifying exact object classes, both approaches output the same code. But, for queries specified in an open-set manner, the newer approach describes the goal to the code, whereas the older approach maps the description to the pre-known classes and passes this class to the code. The older approach benefits from LLM's understanding, whereas the newer approach benefits from the open-set embeddings (CLIP)
  • Figure 5: This figure shows the process of generating a goal point once the target object has been localised based on a sample language query and O3D-SIM.
  • ...and 2 more figures