Table of Contents
Fetching ...

3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

Xiaoxu Xu, Yitian Yuan, Jinlong Li, Qiudan Zhang, Zequn Jie, Lin Ma, Hao Tang, Nicu Sebe, Xu Wang

TL;DR

3DSS-VLG tackles the challenge of 3D semantic segmentation with weak supervision by leveraging 2D vision-language guidance to align 3D point embeddings with both image and text spaces, using only scene-level labels. It introduces a three-stage training framework built on a frozen 2D vision-language backbone and a 3D MinkowskiNet backbone: Pseudo Label Generation Stage, Embeddings Specialization Stage, and Embeddings Soft-Guidance Stage, enabling implicit cross-modal alignment through pseudo labels and a dedicated adapter. The method achieves state-of-the-art results on the S3DIS and ScanNet datasets under scene-level supervision and shows robust generalization to unseen domains, often surpassing methods that require more supervision. This approach highlights the practical value of textual semantic information and 2D-3D correspondences for reducing annotation costs while maintaining high segmentation quality in indoor scenes.

Abstract

In this paper, we propose 3DSS-VLG, a weakly supervised approach for 3D Semantic Segmentation with 2D Vision-Language Guidance, an alternative approach that a 3D model predicts dense-embedding for each point which is co-embedded with both the aligned image and text spaces from the 2D vision-language model. Specifically, our method exploits the superior generalization ability of the 2D vision-language models and proposes the Embeddings Soft-Guidance Stage to utilize it to implicitly align 3D embeddings and text embeddings. Moreover, we introduce the Embeddings Specialization Stage to purify the feature representation with the help of a given scene-level label, specifying a better feature supervised by the corresponding text embedding. Thus, the 3D model is able to gain informative supervisions both from the image embedding and text embedding, leading to competitive segmentation performances. To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels. Moreover, with extensive quantitative and qualitative experiments, we present that our 3DSS-VLG is able not only to achieve the state-of-the-art performance on both S3DIS and ScanNet datasets, but also to maintain strong generalization capability.

3D Weakly Supervised Semantic Segmentation with 2D Vision-Language Guidance

TL;DR

3DSS-VLG tackles the challenge of 3D semantic segmentation with weak supervision by leveraging 2D vision-language guidance to align 3D point embeddings with both image and text spaces, using only scene-level labels. It introduces a three-stage training framework built on a frozen 2D vision-language backbone and a 3D MinkowskiNet backbone: Pseudo Label Generation Stage, Embeddings Specialization Stage, and Embeddings Soft-Guidance Stage, enabling implicit cross-modal alignment through pseudo labels and a dedicated adapter. The method achieves state-of-the-art results on the S3DIS and ScanNet datasets under scene-level supervision and shows robust generalization to unseen domains, often surpassing methods that require more supervision. This approach highlights the practical value of textual semantic information and 2D-3D correspondences for reducing annotation costs while maintaining high segmentation quality in indoor scenes.

Abstract

In this paper, we propose 3DSS-VLG, a weakly supervised approach for 3D Semantic Segmentation with 2D Vision-Language Guidance, an alternative approach that a 3D model predicts dense-embedding for each point which is co-embedded with both the aligned image and text spaces from the 2D vision-language model. Specifically, our method exploits the superior generalization ability of the 2D vision-language models and proposes the Embeddings Soft-Guidance Stage to utilize it to implicitly align 3D embeddings and text embeddings. Moreover, we introduce the Embeddings Specialization Stage to purify the feature representation with the help of a given scene-level label, specifying a better feature supervised by the corresponding text embedding. Thus, the 3D model is able to gain informative supervisions both from the image embedding and text embedding, leading to competitive segmentation performances. To the best of our knowledge, this is the first work to investigate 3D weakly supervised semantic segmentation by using the textual semantic information of text category labels. Moreover, with extensive quantitative and qualitative experiments, we present that our 3DSS-VLG is able not only to achieve the state-of-the-art performance on both S3DIS and ScanNet datasets, but also to maintain strong generalization capability.
Paper Structure (18 sections, 2 equations, 4 figures, 5 tables)

This paper contains 18 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparison of different approaches. (a) The conventional 3D WSSS approach adopts the coarse-grained CAM method in a global manner and is supervised by scene-level annotations or subcloud-level annotations. (b) Our proposed 3DSS-VLG approach leverages natural 3D-2D correspondence from geometric camera calibration and 2D-text correspondence from vision-language models, to implicitly align texts and 3D point clouds.
  • Figure 2: The proposed pseudo label generation procedure. We first leverage the text encoder $\varepsilon^{text}$ of Openseg to get embeddings of the full category labels $\mathbf{F}^C$, and leverage the 2D image encoder $\varepsilon^{2D}$ of Openseg to get embeddings of the 2D image $\mathbf{F}^{2D}$. It is important to note that we freeze the whole Openseg model during the procedure of pseudo label generation. Then we back-project the 2D embeddings $\mathbf{F}^{2D}$ to integrate the 2D-projected embeddings $\mathbf{P}^{2D}$. Specifically, for each point in the point cloud $(x^{3D}, y^{3D},z^{3D})$, we use geometric camera calibration matrixes $GCCM^{img}$ to calculate the corresponding positions $(x^{2D}, y^{2D})$ on the multi-view images $S$. Then we integrate these corresponding 2D embeddings in $\mathbf{F}^{2D}$ and average them to get the 2D-projected embeddings $\mathbf{P}^{2D}$. We perform matrix multiplication on $\mathbf{F}^{C}$ and $\mathbf{P}^{2D}$, and get the 3D point cloud semantic segmentation prediction logits $\mathbf{L}^{2D}$. Finally we utilize the scene-level labels as mask $M$ to filter out some confusing and unreliable predictions in the classification and get the more accurate predicted logits $\mathbf{L}^{f}$ and pseudo labels $\mathbf{Y}$.
  • Figure 3: The proposed training procedure of our proposed 3DSS-VLG. Here, it is mainly divided into two stages: (a) Embeddings Specialization Stage and (b) Embeddings Soft-Guidance Stage. For (a), we first utilize the text encoder $\varepsilon^{text}$ of Openseg to obtain embeddings of the category labels $\mathbf{F}^C$, which are frozen during the training procedure of (a). Meanwhile, we get the initial 2D-projected embeddings $\mathbf{P}^{2D}$ from the 2D module and leverage the adapter module to transfer the $\mathbf{P}^{2D}$ to a new embedding spaces to obtain the adapted 3D embeddings $\mathbf{A}^{3D}$. We perform matrix multiplication on $\mathbf{A}^{3D}$ and $\mathbf{F}^C$ and get the predicted probability $\mathbf{L}^{a}$. Finally, we use the pseudo labels $\mathbf{Y}$ to supervise the model, and the green dashed lines denote back-propagation of the loss $\mathcal{L}_a$. For (b), we first utilize the adapter module and obtain the adapted 3D embeddings $\mathbf{A}^{3D}$. It is important to note that we freeze the adapter module during the training procedure of (b). Meanwhile, we use the 3D module $\varepsilon^{3D}$ to obtain the 3D embeddings $\mathbf{F}^{3D}$. The cosine similarity loss $\mathcal{L}_s$ will be integrated to train the model. The red dashed lines denote back-propagation of the loss $\mathcal{L}_s$.
  • Figure 4: Qualitative results on the S3DIS dataset of baseline and our 3DSS-VLG. From left to right: input point clouds, ground truth, baseline results, and our 3DSS-VLG results.