Table of Contents
Fetching ...

QueSTMaps: Queryable Semantic Topological Maps for 3D Scene Understanding

Yash Mehan, Kumaraditya Gupta, Rohit Jayanti, Anirudh Govil, Sourav Garg, Madhava Krishna

TL;DR

This work addresses the need for joint topological and semantic understanding of 3D indoor scenes to support natural-language querying and navigation. It introduces a two-step pipeline: (1) a novel multi-channel occupancy representation to extract topology (rooms and transitions) via Mask R-CNN, and (2) a CLIP-aligned, self-attention-based room labeling system built on a 3D object instance map. Key contributions include the multi-channel occupancy representation, CLIP-based room embeddings derived from object content, and an end-to-end pipeline extended to Matterport3D with transition-region annotations, achieving state-of-the-art performance on floorplan extraction and room labeling by notable margins. The approach enables open-vocabulary, room-level queries (e.g., “place to cook” for locating kitchens) and has practical impact for robot navigation and planning in complex indoor environments. Future work may address scene-zone segmentation and integration with downstream planning systems.

Abstract

Robotic tasks such as planning and navigation require a hierarchical semantic understanding of a scene, which could include multiple floors and rooms. Current methods primarily focus on object segmentation for 3D scene understanding. However, such methods struggle to segment out topological regions like "kitchen" in the scene. In this work, we introduce a two-step pipeline to solve this problem. First, we extract a topological map, i.e., floorplan of the indoor scene using a novel multi-channel occupancy representation. Then, we generate CLIP-aligned features and semantic labels for every room instance based on the objects it contains using a self-attention transformer. Our language-topology alignment supports natural language querying, e.g., a "place to cook" locates the "kitchen". We outperform the current state-of-the-art on room segmentation by ~20% and room classification by ~12%. Our detailed qualitative analysis and ablation studies provide insights into the problem of joint structural and semantic 3D scene understanding. Project Page: quest-maps.github.io

QueSTMaps: Queryable Semantic Topological Maps for 3D Scene Understanding

TL;DR

This work addresses the need for joint topological and semantic understanding of 3D indoor scenes to support natural-language querying and navigation. It introduces a two-step pipeline: (1) a novel multi-channel occupancy representation to extract topology (rooms and transitions) via Mask R-CNN, and (2) a CLIP-aligned, self-attention-based room labeling system built on a 3D object instance map. Key contributions include the multi-channel occupancy representation, CLIP-based room embeddings derived from object content, and an end-to-end pipeline extended to Matterport3D with transition-region annotations, achieving state-of-the-art performance on floorplan extraction and room labeling by notable margins. The approach enables open-vocabulary, room-level queries (e.g., “place to cook” for locating kitchens) and has practical impact for robot navigation and planning in complex indoor environments. Future work may address scene-zone segmentation and integration with downstream planning systems.

Abstract

Robotic tasks such as planning and navigation require a hierarchical semantic understanding of a scene, which could include multiple floors and rooms. Current methods primarily focus on object segmentation for 3D scene understanding. However, such methods struggle to segment out topological regions like "kitchen" in the scene. In this work, we introduce a two-step pipeline to solve this problem. First, we extract a topological map, i.e., floorplan of the indoor scene using a novel multi-channel occupancy representation. Then, we generate CLIP-aligned features and semantic labels for every room instance based on the objects it contains using a self-attention transformer. Our language-topology alignment supports natural language querying, e.g., a "place to cook" locates the "kitchen". We outperform the current state-of-the-art on room segmentation by ~20% and room classification by ~12%. Our detailed qualitative analysis and ablation studies provide insights into the problem of joint structural and semantic 3D scene understanding. Project Page: quest-maps.github.io
Paper Structure (20 sections, 4 equations, 4 figures, 3 tables)

This paper contains 20 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Method Overview. From a 3D scene point cloud reconstructed from a posed RGB-D sequence, our method builds an indoor scene topology map, extracts room and transition region segments, and generates room-label aligned CLIP embeddings and room labels. We propose a novel multi-channel input representation and use an instance segmentation network to predict room masks and detect transition regions. Object level point clouds and CLIP embeddings are generated using an instance mapping pipeline similar to ConceptGraphs conceptgraphs and are associated with their rooms from the previous step. A Transformer network takes these object level CLIP embeddings to generate a room label and an aligned CLIP embedding, useful for tasks such as languge guided robot navigation with a room-level understanding.
  • Figure 2: Qualitative comparisons on room segmentation task. Our proposed method is able to handle adverse cases such as regions of low density (introduced by the scanning strategy) in the Stuctured3D dataset, as well as curved walls and clutter as seen in the Matterport3D dataset. Colors are only representative of the instance separation.
  • Figure 3: Qualitative results on Matterport3D.Left: Original 3D Scene Point cloud, our multi-channel input representation, and the predicted room segmentation masks for reference. Right: Qualitative comparisons for various room label queries. Here sections of the point cloud are colored based on the similarity of the corresponding query embedding. Color progression from blue to green to yellow signify low to medium to high similarity scores.
  • Figure : Topology Extraction and Room Labeling for Indoor Scene 3D Point clouds. Given a 3D scene point cloud we (a) predict room and transition regions, (b) generate room-label aligned CLIP embeddings to assign room labels, and (c) build a topological map that supports room-level natural language queries, e.g., the query place to cook locates the kitchen.