Table of Contents
Fetching ...

Mapping High-level Semantic Regions in Indoor Environments without Object Recognition

Roberto Bigazzi, Lorenzo Baraldi, Shreyas Kousik, Rita Cucchiara, Marco Pavone

TL;DR

This work tackles Indoor Semantic Region Mapping (ISRM) by enabling a robot to build a global map of high-level indoor regions without relying on object detection. It introduces a CLIP-based region classifier trained with a multi-modal supervised contrastive loss, integrated into an egocentric-to-global mapping framework with a hierarchical navigation policy. The authors create a large offline MP3D/Habitat dataset and demonstrate substantial gains in region labeling and mapping accuracy, including online, room-level mapping in photorealistic simulations, outperforming object-based baselines. The results indicate that grounding high-level region semantics can improve autonomous navigation, explainability, and human-robot collaboration in indoor environments.

Abstract

Robots require a semantic understanding of their surroundings to operate in an efficient and explainable way in human environments. In the literature, there has been an extensive focus on object labeling and exhaustive scene graph generation; less effort has been focused on the task of purely identifying and mapping large semantic regions. The present work proposes a method for semantic region mapping via embodied navigation in indoor environments, generating a high-level representation of the knowledge of the agent. To enable region identification, the method uses a vision-to-language model to provide scene information for mapping. By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location. This mapping procedure is paired with a trained navigation policy to enable autonomous map generation. The proposed method significantly outperforms a variety of baselines, including an object-based system and a pretrained scene classifier, in experiments in a photorealistic simulator.

Mapping High-level Semantic Regions in Indoor Environments without Object Recognition

TL;DR

This work tackles Indoor Semantic Region Mapping (ISRM) by enabling a robot to build a global map of high-level indoor regions without relying on object detection. It introduces a CLIP-based region classifier trained with a multi-modal supervised contrastive loss, integrated into an egocentric-to-global mapping framework with a hierarchical navigation policy. The authors create a large offline MP3D/Habitat dataset and demonstrate substantial gains in region labeling and mapping accuracy, including online, room-level mapping in photorealistic simulations, outperforming object-based baselines. The results indicate that grounding high-level region semantics can improve autonomous navigation, explainability, and human-robot collaboration in indoor environments.

Abstract

Robots require a semantic understanding of their surroundings to operate in an efficient and explainable way in human environments. In the literature, there has been an extensive focus on object labeling and exhaustive scene graph generation; less effort has been focused on the task of purely identifying and mapping large semantic regions. The present work proposes a method for semantic region mapping via embodied navigation in indoor environments, generating a high-level representation of the knowledge of the agent. To enable region identification, the method uses a vision-to-language model to provide scene information for mapping. By projecting egocentric scene understanding into the global frame, the proposed method generates a semantic map as a distribution over possible region labels at each location. This mapping procedure is paired with a trained navigation policy to enable autonomous map generation. The proposed method significantly outperforms a variety of baselines, including an object-based system and a pretrained scene classifier, in experiments in a photorealistic simulator.
Paper Structure (32 sections, 3 equations, 4 figures, 4 tables)

This paper contains 32 sections, 3 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We address the task of indoor semantic region mapping for Embodied Navigation, which requires an agent to build a global semantic understanding of the large-scale regions in an environment. Our approach conditions a learned neural mapper with visual features extracted using a region classifier to produce a geometric region map of the environment.
  • Figure 2: Sample observations and corresponding top-down semantic maps from the extracted dataset. The upper row shows a common challenge faced in the dataset: occlusions and multiple semantic categories in a single image. Furthermore, on the right side of the upper image, the region is not obviously recognizable as a bathroom.
  • Figure 3: Architecture of the proposed Semantic Region Mapper.
  • Figure 4: Samples of generated maps with corresponding ground truth and visual inputs on the environments of Matterport3D. Note that the first row shows that the model defines the region between the TV and the couches as a hallway (in red).