QueSTMaps: Queryable Semantic Topological Maps for 3D Scene Understanding
Yash Mehan, Kumaraditya Gupta, Rohit Jayanti, Anirudh Govil, Sourav Garg, Madhava Krishna
TL;DR
This work addresses the need for joint topological and semantic understanding of 3D indoor scenes to support natural-language querying and navigation. It introduces a two-step pipeline: (1) a novel multi-channel occupancy representation to extract topology (rooms and transitions) via Mask R-CNN, and (2) a CLIP-aligned, self-attention-based room labeling system built on a 3D object instance map. Key contributions include the multi-channel occupancy representation, CLIP-based room embeddings derived from object content, and an end-to-end pipeline extended to Matterport3D with transition-region annotations, achieving state-of-the-art performance on floorplan extraction and room labeling by notable margins. The approach enables open-vocabulary, room-level queries (e.g., “place to cook” for locating kitchens) and has practical impact for robot navigation and planning in complex indoor environments. Future work may address scene-zone segmentation and integration with downstream planning systems.
Abstract
Robotic tasks such as planning and navigation require a hierarchical semantic understanding of a scene, which could include multiple floors and rooms. Current methods primarily focus on object segmentation for 3D scene understanding. However, such methods struggle to segment out topological regions like "kitchen" in the scene. In this work, we introduce a two-step pipeline to solve this problem. First, we extract a topological map, i.e., floorplan of the indoor scene using a novel multi-channel occupancy representation. Then, we generate CLIP-aligned features and semantic labels for every room instance based on the objects it contains using a self-attention transformer. Our language-topology alignment supports natural language querying, e.g., a "place to cook" locates the "kitchen". We outperform the current state-of-the-art on room segmentation by ~20% and room classification by ~12%. Our detailed qualitative analysis and ablation studies provide insights into the problem of joint structural and semantic 3D scene understanding. Project Page: quest-maps.github.io
