Table of Contents
Fetching ...

vS-Graphs: Tightly Coupling Visual SLAM and 3D Scene Graphs Exploiting Hierarchical Scene Understanding

Ali Tourani, Saad Ejaz, Hriday Bavle, Miguel Fernandez-Cortizas, David Morilla-Cabello, Jose Luis Sanchez-Lopez, Holger Voos

TL;DR

vS-Graphs introduces a real-time VSLAM framework that tightly integrates vision-based scene understanding with optimizable 3D scene graphs. By extracting building components (walls and ground) from RGB-D data and inferring high-level structural elements (rooms, floors), the system produces semantically rich maps and more accurate pose estimates. The approach extends ORB-SLAM 3.0 with semantic threads and stores entities in an Atlas for joint geometric-semantic optimization, achieving state-of-the-art trajectory and mapping performance across benchmarks and an in-house dataset, while maintaining real-time operation (~22 FPS). The results demonstrate that visual features can rival LiDAR-based scene graphs in semantic scene understanding, suggesting significant practical impact for semantically aware navigation and mapping in indoor environments.

Abstract

Current Visual Simultaneous Localization and Mapping (VSLAM) systems often struggle to create maps that are both semantically rich and easily interpretable. While incorporating semantic scene knowledge aids in building richer maps with contextual associations among mapped objects, representing them in structured formats, such as scene graphs, has not been widely addressed, resulting in complex map comprehension and limited scalability. This paper introduces vS-Graphs, a novel real-time VSLAM framework that integrates vision-based scene understanding with map reconstruction and comprehensible graph-based representation. The framework infers structural elements (i.e., rooms and floors) from detected building components (i.e., walls and ground surfaces) and incorporates them into optimizable 3D scene graphs. This solution enhances the reconstructed map's semantic richness, comprehensibility, and localization accuracy. Extensive experiments on standard benchmarks and real-world datasets demonstrate that vS-Graphs achieves an average of 15.22% accuracy gain across all tested datasets compared to state-of-the-art VSLAM methods. Furthermore, the proposed framework achieves environment-driven semantic entity detection accuracy comparable to that of precise LiDAR-based frameworks, using only visual features. The code is publicly available at https://github.com/snt-arg/visual_sgraphs and is actively being improved. Moreover, a web page containing more media and evaluation outcomes is available on https://snt-arg.github.io/vsgraphs-results/.

vS-Graphs: Tightly Coupling Visual SLAM and 3D Scene Graphs Exploiting Hierarchical Scene Understanding

TL;DR

vS-Graphs introduces a real-time VSLAM framework that tightly integrates vision-based scene understanding with optimizable 3D scene graphs. By extracting building components (walls and ground) from RGB-D data and inferring high-level structural elements (rooms, floors), the system produces semantically rich maps and more accurate pose estimates. The approach extends ORB-SLAM 3.0 with semantic threads and stores entities in an Atlas for joint geometric-semantic optimization, achieving state-of-the-art trajectory and mapping performance across benchmarks and an in-house dataset, while maintaining real-time operation (~22 FPS). The results demonstrate that visual features can rival LiDAR-based scene graphs in semantic scene understanding, suggesting significant practical impact for semantically aware navigation and mapping in indoor environments.

Abstract

Current Visual Simultaneous Localization and Mapping (VSLAM) systems often struggle to create maps that are both semantically rich and easily interpretable. While incorporating semantic scene knowledge aids in building richer maps with contextual associations among mapped objects, representing them in structured formats, such as scene graphs, has not been widely addressed, resulting in complex map comprehension and limited scalability. This paper introduces vS-Graphs, a novel real-time VSLAM framework that integrates vision-based scene understanding with map reconstruction and comprehensible graph-based representation. The framework infers structural elements (i.e., rooms and floors) from detected building components (i.e., walls and ground surfaces) and incorporates them into optimizable 3D scene graphs. This solution enhances the reconstructed map's semantic richness, comprehensibility, and localization accuracy. Extensive experiments on standard benchmarks and real-world datasets demonstrate that vS-Graphs achieves an average of 15.22% accuracy gain across all tested datasets compared to state-of-the-art VSLAM methods. Furthermore, the proposed framework achieves environment-driven semantic entity detection accuracy comparable to that of precise LiDAR-based frameworks, using only visual features. The code is publicly available at https://github.com/snt-arg/visual_sgraphs and is actively being improved. Moreover, a web page containing more media and evaluation outcomes is available on https://snt-arg.github.io/vsgraphs-results/.

Paper Structure

This paper contains 15 sections, 18 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: A reconstructed map tailored to the optimizable 3D scene graph generated by the proposed vS-Graphs, enriched with environment-driven semantic entities. Distinct color point clouds represent different building components (sequence MR03 of the AutoSense dataset).
  • Figure 2: The multi-thread architecture of vS-Graphs. Modules with dashed borders and a light gray background are inherited directly from the baseline (i.e., ORB-SLAM 3.0), while the remaining components are newly added or modified modules.
  • Figure 3: Scene graph structure generated using vS-Graphs, creating a hierarchical representation of the environment.
  • Figure 4: In-house dataset collection using the AutoSense device: a) the setup overview, b) the device mounted on a legged robot, and c) some instances of the collected data.
  • Figure 5: Mapping performance across eight iterations, showing AutoSense sequences with less than one meter RMSE.
  • ...and 5 more figures