Table of Contents
Fetching ...

SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment

Binod Singh, Sayan Deb Sarkar, Iro Armeni

TL;DR

SGAligner++, a cross-modal, language-aided framework for 3D scene graph alignment, addresses the challenge of aligning partially overlapping scene observations across heterogeneous modalities by learning a unified joint embedding space, enabling accurate alignment even under low-overlap conditions and sensor noise.

Abstract

Aligning 3D scene graphs is a crucial initial step for several applications in robot navigation and embodied perception. Current methods in 3D scene graph alignment often rely on single-modality point cloud data and struggle with incomplete or noisy input. We introduce SGAligner++, a cross-modal, language-aided framework for 3D scene graph alignment. Our method addresses the challenge of aligning partially overlapping scene observations across heterogeneous modalities by learning a unified joint embedding space, enabling accurate alignment even under low-overlap conditions and sensor noise. By employing lightweight unimodal encoders and attention-based fusion, SGAligner++ enhances scene understanding for tasks such as visual localization, 3D reconstruction, and navigation, while ensuring scalability and minimal computational overhead. Extensive evaluations on real-world datasets demonstrate that SGAligner++ outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions, while enabling cross-modal generalization.

SGAligner++: Cross-Modal Language-Aided 3D Scene Graph Alignment

TL;DR

SGAligner++, a cross-modal, language-aided framework for 3D scene graph alignment, addresses the challenge of aligning partially overlapping scene observations across heterogeneous modalities by learning a unified joint embedding space, enabling accurate alignment even under low-overlap conditions and sensor noise.

Abstract

Aligning 3D scene graphs is a crucial initial step for several applications in robot navigation and embodied perception. Current methods in 3D scene graph alignment often rely on single-modality point cloud data and struggle with incomplete or noisy input. We introduce SGAligner++, a cross-modal, language-aided framework for 3D scene graph alignment. Our method addresses the challenge of aligning partially overlapping scene observations across heterogeneous modalities by learning a unified joint embedding space, enabling accurate alignment even under low-overlap conditions and sensor noise. By employing lightweight unimodal encoders and attention-based fusion, SGAligner++ enhances scene understanding for tasks such as visual localization, 3D reconstruction, and navigation, while ensuring scalability and minimal computational overhead. Extensive evaluations on real-world datasets demonstrate that SGAligner++ outperforms state-of-the-art methods by up to 40% on noisy real-world reconstructions, while enabling cross-modal generalization.

Paper Structure

This paper contains 13 sections, 3 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: SGAligner++. We address the problem of aligning 3D scene graphs across different modalities, namely, point clouds, CAD meshes, text captions, and spatial referrals, using a joint embedding space. Our approach creates a unified 3D scene graph, ensuring that spatial relationships are accurately preserved. It enables robust 3D scene understanding for visual localization and robot navigation.
  • Figure 2: Overview of SGAligner++. Our method takes as input: (a) two scene point clouds with spatially overlapping objects, and (b) their corresponding 3D scene graphs with multi-modal information--point clouds, CAD meshes, text captions, and spatial referrals. (c) We process the data via separate uni-modal encoders and optimize them together in a joint embedding space using trainable attention. (d) Similar nodes are aligned together in the common space and we finally output a unified 3D scene graph, which preserves spatial-semantic consistency and enables multiple downstream tasks.
  • Figure 3: Example of context-aware LLM-generated scene graphs. Overlapping pairs are in green and non-overlapping are in red.
  • Figure 4: Qualitative Results on Node Matching. Given two partially overlapping observations of the same scene, EVA eva is unable to identify any correct matches, aligning objects within the same scene and SGAligner sarkar2023sgaligner cannot handle intra-class instances (e.g. two chairs). In contrast, SGAligner++ correctly identifies all point cloud matches, as well as CAD ones. Numbers indicate common objects across the two overlapping scenes.
  • Figure 5: Node Matching Mean RR vs. Overlap Range, on 3RScan. SGAligner++ generalizes across overlap thresholds and performs robustly even in low-overlap cases.