Table of Contents
Fetching ...

ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding

Quang P. M. Pham, Khoi T. N. Nguyen, Lan C. Ngo, Truong Do, Truong Son Hy

TL;DR

The paper tackles robust 3D scene graph generation from point clouds by preserving geometric symmetry through an Equivariant Graph Neural Network (ESGNN). ESGNN combines FAN-GCL attention with EGCL-based equivariant message passing to maintain $E(n)$-equivariance, enabling stable representations under rotations and translations with fewer layers and lower computational costs. Using PointNet-based segment encoders and a neighbor graph, ESGNN achieves faster convergence and improved relation prediction on 3DSSG/3RScan benchmarks, including unseen triplets, while remaining compatible with existing frameworks. This approach holds practical significance for real-time 3D scene understanding in robotics and computer vision, facilitating robust perception with efficient resource use and potential for image-guided extensions.

Abstract

Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, multi-view 3D data. This work, to the best of our knowledge, is the first to implement an Equivariant Graph Neural Network in semantic scene graph generation from 3D point clouds for scene understanding. Our proposed method, ESGNN, outperforms existing state-of-the-art approaches, demonstrating a significant improvement in scene estimation with faster convergence. ESGNN demands low computational resources and is easy to implement from available frameworks, paving the way for real-time applications such as robotics and computer vision.

ESGNN: Towards Equivariant Scene Graph Neural Network for 3D Scene Understanding

TL;DR

The paper tackles robust 3D scene graph generation from point clouds by preserving geometric symmetry through an Equivariant Graph Neural Network (ESGNN). ESGNN combines FAN-GCL attention with EGCL-based equivariant message passing to maintain -equivariance, enabling stable representations under rotations and translations with fewer layers and lower computational costs. Using PointNet-based segment encoders and a neighbor graph, ESGNN achieves faster convergence and improved relation prediction on 3DSSG/3RScan benchmarks, including unseen triplets, while remaining compatible with existing frameworks. This approach holds practical significance for real-time 3D scene understanding in robotics and computer vision, facilitating robust perception with efficient resource use and potential for image-guided extensions.

Abstract

Scene graphs have been proven to be useful for various scene understanding tasks due to their compact and explicit nature. However, existing approaches often neglect the importance of maintaining the symmetry-preserving property when generating scene graphs from 3D point clouds. This oversight can diminish the accuracy and robustness of the resulting scene graphs, especially when handling noisy, multi-view 3D data. This work, to the best of our knowledge, is the first to implement an Equivariant Graph Neural Network in semantic scene graph generation from 3D point clouds for scene understanding. Our proposed method, ESGNN, outperforms existing state-of-the-art approaches, demonstrating a significant improvement in scene estimation with faster convergence. ESGNN demands low computational resources and is easy to implement from available frameworks, paving the way for real-time applications such as robotics and computer vision.
Paper Structure (23 sections, 3 equations, 5 figures, 3 tables)

This paper contains 23 sections, 3 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Overview of the proposed Equivariant Scene Graph framework. Our approach takes a sequence of point clouds a) as input to generate a geometric segmentation b). Subsequently, the properties of each segment and a neighbor graph between segments are constructed. The properties d) and neighbor graph e) of the segments that have been updated in the current frame c) are used as the inputs to compute node and edge features f) and to predict a 3D scene graph g).
  • Figure 2: ESGNN Architecture.
  • Figure 3: Comparison of ESGNN with SGFN through the training steps.
  • Figure 4: Comparison of multiple ESGNN models with SGFN through the training steps.
  • Figure 5: Comparison of Joint-ESGNN, SGFN, JointSSG through the training steps.