Open-Set 3D Semantic Instance Maps for Vision Language Navigation -- O3D-SIM
Laksh Nanwani, Kumaraditya Gupta, Aditya Mathur, Swayam Agrawal, A. H. Abdul Hafez, K. Madhava Krishna
TL;DR
O3D-SIM introduces an open-set 3D semantic instance map for vision-language navigation, extending prior 2D closed-set representations to a 3D space with per-instance embeddings derived from open-set models. The pipeline combines RAM, Grounding DINO, SAM, CLIP, and DINOv2 to produce per-object masks and embeddings, which are back-projected into 3D and incrementally clustered into a unified map. This open-set 3D representation enhances grounding of natural language commands, including unseen object categories, and improves navigation success rates in both Matterport3D/Habitat and real-world experiments. The work demonstrates that integrating foundational vision-language models with a 3D semantic map can approach human-like understanding for complex, instance-specific queries in realistic environments.
Abstract
Humans excel at forming mental maps of their surroundings, equipping them to understand object relationships and navigate based on language queries. Our previous work, SI Maps (Nanwani L, Agarwal A, Jain K, et al. Instance-level semantic maps for vision language navigation. In: 2023 32nd IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE; 2023 Aug.), showed that having instance-level information and the semantic understanding of an environment helps significantly improve performance for language-guided tasks. We extend this instance-level approach to 3D while increasing the pipeline's robustness and improving quantitative and qualitative results. Our method leverages foundational models for object recognition, image segmentation, and feature extraction. We propose a representation that results in a 3D point cloud map with instance-level embeddings, which bring in the semantic understanding that natural language commands can query. Quantitatively, the work improves upon the success rate of language-guided tasks. At the same time, we qualitatively observe the ability to identify instances more clearly and leverage the foundational models and language and image-aligned embeddings to identify objects that, otherwise, a closed-set approach wouldn't be able to identify. Project Page - https://smart-wheelchair-rrc.github.io/o3d-sim-webpage
