Table of Contents
Fetching ...

SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization

Mingyang Li, Yue Ma, Qinru Qiu

TL;DR

SemanticSLAM addresses the memory and computation burden of traditional VSLAM by learning a semantic, neural-symbolic representation of the environment and performing visual-inertial localization at reduced image frequency. The method extracts semantic features from RGB-D data, projects them into an egocentric observation map, and maintains an allocentric semantic map updated by a ConvLSTM, with pose estimates produced from a fusion of visual cues and IMU data; the system outputs a pose distribution $p_t$ over a discrete grid of poses and an $L$-channel semantic map $m_t$. Training optimizes a KL-divergence loss over the map sequence to refine the map quality, enabling gradual improvement of both localization and mapping; a ROI-based update strategy helps mitigate drift and observation noise. Experiments on IndoorScenes show that SemanticSLAM outperforms baselines in both accuracy and generalization, yielding interpretable semantic maps that support downstream tasks such as navigation and obstacle avoidance, with potential benefits for multi-robot sharing.

Abstract

Current techniques in Visual Simultaneous Localization and Mapping (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visual-inertial odometry system that utilizes semantic features extracted from an RGB-D sensor. This approach enables the creation of a semantic map of the environment and ensures reliable camera localization. SemanticSLAM is scene-agnostic, which means it doesn't require retraining for different environments. It operates effectively in indoor settings, even with infrequent camera input, without prior knowledge. The strength of SemanticSLAM lies in its ability to gradually refine the semantic map and improve pose estimation. This is achieved by a convolutional long-short-term-memory (ConvLSTM) network, trained to correct errors during map construction. Compared to existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The resulting semantic map provides interpretable information about the environment and can be easily applied to various downstream tasks, such as path planning, obstacle avoidance, and robot navigation. The code will be publicly available at https://github.com/Leomingyangli/SemanticSLAM

SemanticSLAM: Learning based Semantic Map Construction and Robust Camera Localization

TL;DR

SemanticSLAM addresses the memory and computation burden of traditional VSLAM by learning a semantic, neural-symbolic representation of the environment and performing visual-inertial localization at reduced image frequency. The method extracts semantic features from RGB-D data, projects them into an egocentric observation map, and maintains an allocentric semantic map updated by a ConvLSTM, with pose estimates produced from a fusion of visual cues and IMU data; the system outputs a pose distribution over a discrete grid of poses and an -channel semantic map . Training optimizes a KL-divergence loss over the map sequence to refine the map quality, enabling gradual improvement of both localization and mapping; a ROI-based update strategy helps mitigate drift and observation noise. Experiments on IndoorScenes show that SemanticSLAM outperforms baselines in both accuracy and generalization, yielding interpretable semantic maps that support downstream tasks such as navigation and obstacle avoidance, with potential benefits for multi-robot sharing.

Abstract

Current techniques in Visual Simultaneous Localization and Mapping (VSLAM) estimate camera displacement by comparing image features of consecutive scenes. These algorithms depend on scene continuity, hence requires frequent camera inputs. However, processing images frequently can lead to significant memory usage and computation overhead. In this study, we introduce SemanticSLAM, an end-to-end visual-inertial odometry system that utilizes semantic features extracted from an RGB-D sensor. This approach enables the creation of a semantic map of the environment and ensures reliable camera localization. SemanticSLAM is scene-agnostic, which means it doesn't require retraining for different environments. It operates effectively in indoor settings, even with infrequent camera input, without prior knowledge. The strength of SemanticSLAM lies in its ability to gradually refine the semantic map and improve pose estimation. This is achieved by a convolutional long-short-term-memory (ConvLSTM) network, trained to correct errors during map construction. Compared to existing VSLAM algorithms, SemanticSLAM improves pose estimation by 17%. The resulting semantic map provides interpretable information about the environment and can be easily applied to various downstream tasks, such as path planning, obstacle avoidance, and robot navigation. The code will be publicly available at https://github.com/Leomingyangli/SemanticSLAM
Paper Structure (16 sections, 10 equations, 5 figures, 3 tables)

This paper contains 16 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: System Overview of SemanticSLAM
  • Figure 2: IndoorScenes Dataset
  • Figure 3: Localization performance over time. Our model (green solid) vs. reference models (dashed lines)
  • Figure 4: Map construction loss over time
  • Figure 5: Visualization of the Map Update Process. From left to right: Original global map $m_{t-1}$, Semantic observation $o_{t}$, Updated global map $m_{t}$, and Ground-truth.