Table of Contents
Fetching ...

ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining

Yucheng Huang, Luping Ji, Xiangwei Jiang, Wen Li, Mao Ye

Abstract

3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and semantic-affordance perception. However, its generalizability is often constrained by data scarcity. Current solutions primarily focus on cross-modal assisted representation learning and object-centric generation pre-training. The former relies heavily on predicate annotations, while the latter's predicate learning may be bypassed due to strong object priors. Consequently, they could not often provide a label-free and robust self-supervised proxy task for 3DSG fine-tuning. To bridge this gap, we propose a Topological Layout Learning (ToLL) for 3DSG pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning, with a GNN to recover the global layout of zero-centered subgraphs by the spatial priors from sparse anchors. This process is strictly modulated by predicate features, thereby enforcing the predicate relation learning. Furthermore, we construct a Structural Multi-view Augmentation to avoid semantic corruption, and enhancing representations via self-distillation. The extensive experiments on 3DSSG dataset demonstrate that our ToLL could improve representation quality, outperforming state-of-the-art baselines.

ToLL: Topological Layout Learning with Structural Multi-view Augmentation for 3D Scene Graph Pretraining

Abstract

3D Scene Graph (3DSG) generation plays a pivotal role in spatial understanding and semantic-affordance perception. However, its generalizability is often constrained by data scarcity. Current solutions primarily focus on cross-modal assisted representation learning and object-centric generation pre-training. The former relies heavily on predicate annotations, while the latter's predicate learning may be bypassed due to strong object priors. Consequently, they could not often provide a label-free and robust self-supervised proxy task for 3DSG fine-tuning. To bridge this gap, we propose a Topological Layout Learning (ToLL) for 3DSG pretraining framework. In detail, we design an Anchor-Conditioned Topological Geometry Reasoning, with a GNN to recover the global layout of zero-centered subgraphs by the spatial priors from sparse anchors. This process is strictly modulated by predicate features, thereby enforcing the predicate relation learning. Furthermore, we construct a Structural Multi-view Augmentation to avoid semantic corruption, and enhancing representations via self-distillation. The extensive experiments on 3DSSG dataset demonstrate that our ToLL could improve representation quality, outperforming state-of-the-art baselines.

Paper Structure

This paper contains 48 sections, 35 equations, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Proxy task scheme comparison of 3DSG generation: (a) cross-modal Vision-Language representation, (b) trivial scene generation, (c) non-trivial scene generation, $i.e.$, our ToLL.
  • Figure 2: Our 3D Scene Graph Pretraining scheme via Topological Layout Learning with Structural Multi-view Augmentation.
  • Figure 3: Predicate A@1 for all predicate categories.
  • Figure 4: Visualization of latent features clustering.
  • Figure 5: Visualization of latent features clustering.
  • ...and 4 more figures