Table of Contents
Fetching ...

Towards Localizing Structural Elements: Merging Geometrical Detection with Semantic Verification in RGB-D Data

Ali Tourani, Saad Ejaz, Hriday Bavle, Jose Luis Sanchez-Lopez, Holger Voos

TL;DR

This work addresses localizing building components in RGB-D data by merging geometric plane detection with semantic verification. It presents a real-time, parallel pipeline that first extracts 3D planes from depth via down-sampling and RANSAC, then validates their semantic labels using panoptic segmentation, and finally fuses the two modalities. By integrating into ORB-SLAM 3.0, the authors demonstrate improved map reconstruction and the emergence of environment-driven constraints that enhance 3D scene graphs. Experiments on ICL, ScanNet, and in-house data show high component recognition accuracy and notable RMSE improvements in several sequences, while highlighting challenges with curved walls and occlusion.

Abstract

RGB-D cameras supply rich and dense visual and spatial information for various robotics tasks such as scene understanding, map reconstruction, and localization. Integrating depth and visual information can aid robots in localization and element mapping, advancing applications like 3D scene graph generation and Visual Simultaneous Localization and Mapping (VSLAM). While point cloud data containing such information is primarily used for enhanced scene understanding, exploiting their potential to capture and represent rich semantic information has yet to be adequately targeted. This paper presents a real-time pipeline for localizing building components, including wall and ground surfaces, by integrating geometric calculations for pure 3D plane detection followed by validating their semantic category using point cloud data from RGB-D cameras. It has a parallel multi-thread architecture to precisely estimate poses and equations of all the planes detected in the environment, filters the ones forming the map structure using a panoptic segmentation validation, and keeps only the validated building components. Incorporating the proposed method into a VSLAM framework confirmed that constraining the map with the detected environment-driven semantic elements can improve scene understanding and map reconstruction accuracy. It can also ensure (re-)association of these detected components into a unified 3D scene graph, bridging the gap between geometric accuracy and semantic understanding. Additionally, the pipeline allows for the detection of potential higher-level structural entities, such as rooms, by identifying the relationships between building components based on their layout.

Towards Localizing Structural Elements: Merging Geometrical Detection with Semantic Verification in RGB-D Data

TL;DR

This work addresses localizing building components in RGB-D data by merging geometric plane detection with semantic verification. It presents a real-time, parallel pipeline that first extracts 3D planes from depth via down-sampling and RANSAC, then validates their semantic labels using panoptic segmentation, and finally fuses the two modalities. By integrating into ORB-SLAM 3.0, the authors demonstrate improved map reconstruction and the emergence of environment-driven constraints that enhance 3D scene graphs. Experiments on ICL, ScanNet, and in-house data show high component recognition accuracy and notable RMSE improvements in several sequences, while highlighting challenges with curved walls and occlusion.

Abstract

RGB-D cameras supply rich and dense visual and spatial information for various robotics tasks such as scene understanding, map reconstruction, and localization. Integrating depth and visual information can aid robots in localization and element mapping, advancing applications like 3D scene graph generation and Visual Simultaneous Localization and Mapping (VSLAM). While point cloud data containing such information is primarily used for enhanced scene understanding, exploiting their potential to capture and represent rich semantic information has yet to be adequately targeted. This paper presents a real-time pipeline for localizing building components, including wall and ground surfaces, by integrating geometric calculations for pure 3D plane detection followed by validating their semantic category using point cloud data from RGB-D cameras. It has a parallel multi-thread architecture to precisely estimate poses and equations of all the planes detected in the environment, filters the ones forming the map structure using a panoptic segmentation validation, and keeps only the validated building components. Incorporating the proposed method into a VSLAM framework confirmed that constraining the map with the detected environment-driven semantic elements can improve scene understanding and map reconstruction accuracy. It can also ensure (re-)association of these detected components into a unified 3D scene graph, bridging the gap between geometric accuracy and semantic understanding. Additionally, the pipeline allows for the detection of potential higher-level structural entities, such as rooms, by identifying the relationships between building components based on their layout.
Paper Structure (11 sections, 3 equations, 5 figures, 3 tables)

This paper contains 11 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: The 3D reconstructed map of an indoor environment using the proposed approach with (left) distinct color-coded building components for clear distinction and (right) the original RGB textures for a realistic representation.
  • Figure 2: The order of steps to detect building components from RGB-D data using the proposed approach.
  • Figure 3: The in-house data collection setup (A) and some instances of the collected dataset (B).
  • Figure 4: Reconstructed maps presented in 3D scene graphs enriched with recognized building components in some dataset instances. Labels have been manually added to enhance clarity.
  • Figure 5: Edge-case scenarios posing challenges on the proposed method: a) multiple walls detected due to the curved shape of the wall, b) misclassification of some parts of the cupboard as a wall, c) ignoring glass blocks as walls, and d) the same walls are not associated due to pose error.