Table of Contents
Fetching ...

3D Scene Graph Guided Vision-Language Pre-training

Hao Liu, Yanni Ma, Yan Liu, Haihong Xiao, Ying He

TL;DR

This work tackles 3D vision-language reasoning by proposing a unified pre-training framework guided by 3D scene graphs. It introduces scene graph-guided multi-level contrastive learning (SG_MCL) and masked modality modeling (MMM) to align 3D objects with language across word-level, sentence-level, and scene-level granularity, using scene encoders, text encoders, graph convolutions, and cross-attention. The pre-training objective combines SG_MCL, MMM, detection, and language-to-object losses into L_pre, enabling strong transfer to 3D visual grounding, dense captioning, and question answering. Empirical results show competitive or superior performance across VG, DC, and QA tasks, outperforming task-specific methods and demonstrating the efficacy of a simple, generalizable 3D VL pre-training paradigm anchored in 3D scene graphs.

Abstract

3D vision-language (VL) reasoning has gained significant attention due to its potential to bridge the 3D physical world with natural language descriptions. Existing approaches typically follow task-specific, highly specialized paradigms. Therefore, these methods focus on a limited range of reasoning sub-tasks and rely heavily on the hand-crafted modules and auxiliary losses. This highlights the need for a simpler, unified and general-purpose model. In this paper, we leverage the inherent connection between 3D scene graphs and natural language, proposing a 3D scene graph-guided vision-language pre-training (VLP) framework. Our approach utilizes modality encoders, graph convolutional layers and cross-attention layers to learn universal representations that adapt to a variety of 3D VL reasoning tasks, thereby eliminating the need for task-specific designs. The pre-training objectives include: 1) Scene graph-guided contrastive learning, which leverages the strong correlation between 3D scene graphs and natural language to align 3D objects with textual features at various fine-grained levels; and 2) Masked modality learning, which uses cross-modality information to reconstruct masked words and 3D objects. Instead of directly reconstructing the 3D point clouds of masked objects, we use position clues to predict their semantic categories. Extensive experiments demonstrate that our pre-training model, when fine-tuned on several downstream tasks, achieves performance comparable to or better than existing methods in tasks such as 3D visual grounding, 3D dense captioning, and 3D question answering.

3D Scene Graph Guided Vision-Language Pre-training

TL;DR

This work tackles 3D vision-language reasoning by proposing a unified pre-training framework guided by 3D scene graphs. It introduces scene graph-guided multi-level contrastive learning (SG_MCL) and masked modality modeling (MMM) to align 3D objects with language across word-level, sentence-level, and scene-level granularity, using scene encoders, text encoders, graph convolutions, and cross-attention. The pre-training objective combines SG_MCL, MMM, detection, and language-to-object losses into L_pre, enabling strong transfer to 3D visual grounding, dense captioning, and question answering. Empirical results show competitive or superior performance across VG, DC, and QA tasks, outperforming task-specific methods and demonstrating the efficacy of a simple, generalizable 3D VL pre-training paradigm anchored in 3D scene graphs.

Abstract

3D vision-language (VL) reasoning has gained significant attention due to its potential to bridge the 3D physical world with natural language descriptions. Existing approaches typically follow task-specific, highly specialized paradigms. Therefore, these methods focus on a limited range of reasoning sub-tasks and rely heavily on the hand-crafted modules and auxiliary losses. This highlights the need for a simpler, unified and general-purpose model. In this paper, we leverage the inherent connection between 3D scene graphs and natural language, proposing a 3D scene graph-guided vision-language pre-training (VLP) framework. Our approach utilizes modality encoders, graph convolutional layers and cross-attention layers to learn universal representations that adapt to a variety of 3D VL reasoning tasks, thereby eliminating the need for task-specific designs. The pre-training objectives include: 1) Scene graph-guided contrastive learning, which leverages the strong correlation between 3D scene graphs and natural language to align 3D objects with textual features at various fine-grained levels; and 2) Masked modality learning, which uses cross-modality information to reconstruct masked words and 3D objects. Instead of directly reconstructing the 3D point clouds of masked objects, we use position clues to predict their semantic categories. Extensive experiments demonstrate that our pre-training model, when fine-tuned on several downstream tasks, achieves performance comparable to or better than existing methods in tasks such as 3D visual grounding, 3D dense captioning, and 3D question answering.

Paper Structure

This paper contains 19 sections, 9 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: An illustration of the natural alignment between 3D scene graphs and natural language descriptions. We leverage this correspondence to pre-train our 3D-language model in the form of contrastive learning.
  • Figure 2: The overview of our model. Given a 3D point cloud-text pair, we first use a scene encoding module to extract 3D object proposals and a text encoder to generate textual features. We then treat the 3D object proposals as nodes to construct the scene graph, using a scene graph learning module to update node and edge features. Finally, the model is pre-trained with the proposed scene graph-guided multi-level contrastive learning and masked modality modeling. The pre-trained model can be fine-tuned for various downstream tasks, including 3D visual grounding, 3D dense captioning and 3D question answering.
  • Figure 3: Scene graph-guided multi-level contrastive learning (SG_MCL) strategy. It aligns 3D object and textual features at various levels, i.e., word-object level, sentence-referred object level and scene-level.
  • Figure 4: Qualitative results on downstream tasks: (a) 3D visual grounding, (b) 3D dense captioning and (c) 3D question answering. The green box indicates the ground truth, the blue box represents predictions from our model trained from scratch, and the red box shows predictions from our pre-training model.
  • Figure 5: Qualitative results on downstream 3D visual grounding. The Green box represents the ground-truth, and the red box indicates the prediction.
  • ...and 3 more figures