Table of Contents
Fetching ...

Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling

Xu Wang, Yifan Li, Qiudan Zhang, Wenhui Wu, Mark Junjie Li, Jianmin Jinag

TL;DR

This work addresses the high annotation cost of 3D scene graph generation by proposing 3D-VLAP, a weakly supervised approach that uses cross-modal visual-linguistic models to derive pseudo-labels for objects and relations. It aligns 3D point clouds with 2D images via camera parameters and employs a CLIP-based cross-modal matching to link objects to textual labels, complemented by a Hybrid Matching Strategy and Mask Filter to refine pseudo-labels. An edge self-attention based GNN (ESA-GNN) reasons over the generated graph using a combined loss that includes a contrastive term to align visual and textual features. Empirical results on 3DSSG datasets show that 3D-VLAP achieves competitive performance with fully supervised methods while significantly reducing annotation effort, and the framework can extend to other 3D SGG models, supporting scalable, data-efficient scene understanding in real-world settings.

Abstract

Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes. Extensive experiments demonstrate that our 3D-VLAP achieves comparable results with current advanced fully supervised methods, meanwhile significantly alleviating the pressure of data annotation.

Weakly-Supervised 3D Scene Graph Generation via Visual-Linguistic Assisted Pseudo-labeling

TL;DR

This work addresses the high annotation cost of 3D scene graph generation by proposing 3D-VLAP, a weakly supervised approach that uses cross-modal visual-linguistic models to derive pseudo-labels for objects and relations. It aligns 3D point clouds with 2D images via camera parameters and employs a CLIP-based cross-modal matching to link objects to textual labels, complemented by a Hybrid Matching Strategy and Mask Filter to refine pseudo-labels. An edge self-attention based GNN (ESA-GNN) reasons over the generated graph using a combined loss that includes a contrastive term to align visual and textual features. Empirical results on 3DSSG datasets show that 3D-VLAP achieves competitive performance with fully supervised methods while significantly reducing annotation effort, and the framework can extend to other 3D SGG models, supporting scalable, data-efficient scene understanding in real-world settings.

Abstract

Learning to build 3D scene graphs is essential for real-world perception in a structured and rich fashion. However, previous 3D scene graph generation methods utilize a fully supervised learning manner and require a large amount of entity-level annotation data of objects and relations, which is extremely resource-consuming and tedious to obtain. To tackle this problem, we propose 3D-VLAP, a weakly-supervised 3D scene graph generation method via Visual-Linguistic Assisted Pseudo-labeling. Specifically, our 3D-VLAP exploits the superior ability of current large-scale visual-linguistic models to align the semantics between texts and 2D images, as well as the naturally existing correspondences between 2D images and 3D point clouds, and thus implicitly constructs correspondences between texts and 3D point clouds. First, we establish the positional correspondence from 3D point clouds to 2D images via camera intrinsic and extrinsic parameters, thereby achieving alignment of 3D point clouds and 2D images. Subsequently, a large-scale cross-modal visual-linguistic model is employed to indirectly align 3D instances with the textual category labels of objects by matching 2D images with object category labels. The pseudo labels for objects and relations are then produced for 3D-VLAP model training by calculating the similarity between visual embeddings and textual category embeddings of objects and relations encoded by the visual-linguistic model, respectively. Ultimately, we design an edge self-attention based graph neural network to generate scene graphs of 3D point cloud scenes. Extensive experiments demonstrate that our 3D-VLAP achieves comparable results with current advanced fully supervised methods, meanwhile significantly alleviating the pressure of data annotation.
Paper Structure (22 sections, 12 equations, 7 figures, 7 tables)

This paper contains 22 sections, 12 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: The comparison of fully supervised models and our 3D-VLAP. Our designed 3D-VLAP utilizes a visual-linguistic model to assist in generating pseudo-labels for nodes and edges, thereby supervising the generation of 3D scene graphs. In contrast, fully supervised models directly predict 3D scene graphs supervised by instance-level labels and may incorporate 2D images during training.
  • Figure 2: Comparison of fully supervised and our constructed weakly-supervised signals (triplet-set). In fully supervised annotation, each of the $K$ objects and $K\times(K-1)$ relations in the scene must be labeled individually. In contrast, our weakly-supervised annotation simplifies this process by identifying the triplets present in the scene and their count, thus significantly reducing the annotation cost.
  • Figure 3: A holistic architecture of semantic-enhanced weakly-supervised 3D scene graph generation via visual-linguistic driven pseudo-labeling. We first utilize a Hybrid Matching Strategy to obtain pseudo-labels for each Instance during the training procedure. Then, when generating relational pseudo-labels, these pseudo-labels of objects are used to filter object pairs through a Mask Filter, which facilitates the accuracy of relational pseudo-labels. Finally, these obtained pseudo-labels for objects and relations are assigned as supervisory signals during 3D scene graph model training. In the inference stage, we only need to feed 3D point cloud data into the proposed method to directly generate a 3D scene graph.
  • Figure 4: Mask Filter. The Mask Filter utilizes the pseudo-labels of nodes to filter the candidate edges in the scene graph, thereby optimizing the generation of relation pseudo-labels. (a) represents the scene graph with object pseudo-labels and the supervisory signal triplet set. In (b), the black edges represent the set of candidate edges, while the green edges represent the generated relation pseudo-labels.
  • Figure 5: Qualitative comparison between our proposed method and two excellent fully supervised methods on the 3DSSG wald2020learning dataset, where red indicates misclassified nodes and edges.
  • ...and 2 more figures