Table of Contents
Fetching ...

SG-Reg: Generalizable and Efficient Scene Graph Registration

Chuhao Liu, Zhijian Qiao, Jieqi Shi, Ke Wang, Peize Liu, Shaojie Shen

TL;DR

SG-Reg addresses global registration of semantic scene graphs under real-world noise and cross-domain shifts. It encodes multi-modal node features (semantic label, local topology via a triplet descriptor, and shape) and performs coarse-to-fine matching followed by a robust pose estimator based on $G3Reg$ to estimate the $4$-DoF transformation between graphs. Its self-supervised data generation via FM-Fusion reduces reliance on ground-truth semantic annotations and enhances cross-domain generalization, achieving higher registration recall and lower bandwidth than image- and point-cloud-based baselines across 3RScan, ScanNet, and two-agent SLAM benchmarks. The approach delivers a sparse, efficient representation and robust registration in challenging indoor environments, enabling practical multi-agent semantic SLAM with limited communication.

Abstract

This paper addresses the challenges of registering two rigid semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. The hand-crafted descriptors in classical semantic-aided registration, or the ground-truth annotation reliance in learning-based scene graph registration, impede their application in practical real-world environments. To address the challenges, we design a scene graph network to encode multiple modalities of semantic nodes: open-set semantic feature, local topology with spatial awareness, and shape feature. These modalities are fused to create compact semantic node features. The matching layers then search for correspondences in a coarse-to-fine manner. In the back-end, we employ a robust pose estimator to decide transformation according to the correspondences. We manage to maintain a sparse and hierarchical scene representation. Our approach demands fewer GPU resources and fewer communication bandwidth in multi-agent tasks. Moreover, we design a new data generation approach using vision foundation models and a semantic mapping module to reconstruct semantic scene graphs. It differs significantly from previous works, which rely on ground-truth semantic annotations to generate data. We validate our method in a two-agent SLAM benchmark. It significantly outperforms the hand-crafted baseline in terms of registration success rate. Compared to visual loop closure networks, our method achieves a slightly higher registration recall while requiring only 52 KB of communication bandwidth for each query frame. Code available at: \href{http://github.com/HKUST-Aerial-Robotics/SG-Reg}{http://github.com/HKUST-Aerial-Robotics/SG-Reg}.

SG-Reg: Generalizable and Efficient Scene Graph Registration

TL;DR

SG-Reg addresses global registration of semantic scene graphs under real-world noise and cross-domain shifts. It encodes multi-modal node features (semantic label, local topology via a triplet descriptor, and shape) and performs coarse-to-fine matching followed by a robust pose estimator based on to estimate the -DoF transformation between graphs. Its self-supervised data generation via FM-Fusion reduces reliance on ground-truth semantic annotations and enhances cross-domain generalization, achieving higher registration recall and lower bandwidth than image- and point-cloud-based baselines across 3RScan, ScanNet, and two-agent SLAM benchmarks. The approach delivers a sparse, efficient representation and robust registration in challenging indoor environments, enabling practical multi-agent semantic SLAM with limited communication.

Abstract

This paper addresses the challenges of registering two rigid semantic scene graphs, an essential capability when an autonomous agent needs to register its map against a remote agent, or against a prior map. The hand-crafted descriptors in classical semantic-aided registration, or the ground-truth annotation reliance in learning-based scene graph registration, impede their application in practical real-world environments. To address the challenges, we design a scene graph network to encode multiple modalities of semantic nodes: open-set semantic feature, local topology with spatial awareness, and shape feature. These modalities are fused to create compact semantic node features. The matching layers then search for correspondences in a coarse-to-fine manner. In the back-end, we employ a robust pose estimator to decide transformation according to the correspondences. We manage to maintain a sparse and hierarchical scene representation. Our approach demands fewer GPU resources and fewer communication bandwidth in multi-agent tasks. Moreover, we design a new data generation approach using vision foundation models and a semantic mapping module to reconstruct semantic scene graphs. It differs significantly from previous works, which rely on ground-truth semantic annotations to generate data. We validate our method in a two-agent SLAM benchmark. It significantly outperforms the hand-crafted baseline in terms of registration success rate. Compared to visual loop closure networks, our method achieves a slightly higher registration recall while requiring only 52 KB of communication bandwidth for each query frame. Code available at: \href{http://github.com/HKUST-Aerial-Robotics/SG-Reg}{http://github.com/HKUST-Aerial-Robotics/SG-Reg}.

Paper Structure

This paper contains 71 sections, 22 equations, 17 figures, 18 tables.

Figures (17)

  • Figure 1: Register the semantic scene graphs in the two-agent SLAM system. (a) Captured RGB-D sequences from the two agents in a real-world indoor scene. The two agents move in an opposite direction, creating a large viewpoint difference between their cameras. (b) Visualization of the matched nodes between the semantic scene graphs, which are from the two agents. The scene graphs are constructed using FM-Fusionliu2024fmfusion. The zoomed subvolume showcases examples of inconsistent semantic nodes. For better visualization, only a subset of the semantic labels is displayed. (c) Registration result.
  • Figure 2: Our system overview. We denote the encoded node features as ${}^l\mathbf{X}^{A/B}$, where its layer index $l \in \{0,1,2\}$.
  • Figure 3: Visualization of a semantic scene graph from ScanNet scene0025_00. Each node's point cloud is distinctly colored. For node $\mathbf{v}_i$, we illustrate one of its triplet. Additionally, the implicit features derived from $\mathbf{v_i}$ are displayed.
  • Figure 4: Visualization of the shape network structure and its point aggregation kernels. Point backbone uses grid sub-sampling to decide aggregation kernels, which are small and dense. Shape backbone following instance segmentation to create aggregation kernels, which are large and sparse.
  • Figure 5: Two agent SLAM system structure. Module marked with * runs in offline.
  • ...and 12 more figures