Table of Contents
Fetching ...

Visual Foundation Models Boost Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation

Jingyi Xu, Weidong Yang, Lingdong Kong, Youquan Liu, Rui Zhang, Qingyuan Zhou, Ben Fei

TL;DR

This work proposes a novel pipeline VFMSeg to further enhance the cross-modal unsupervised domain adaptation framework by leveraging these models and studies how to harness the knowledge priors learned by VFMs to produce more accurate labels for unlabeled target domains and improve overall performance.

Abstract

Unsupervised domain adaptation (UDA) is vital for alleviating the workload of labeling 3D point cloud data and mitigating the absence of labels when facing a newly defined domain. Various methods of utilizing images to enhance the performance of cross-domain 3D segmentation have recently emerged. However, the pseudo labels, which are generated from models trained on the source domain and provide additional supervised signals for the unseen domain, are inadequate when utilized for 3D segmentation due to their inherent noisiness and consequently restrict the accuracy of neural networks. With the advent of 2D visual foundation models (VFMs) and their abundant knowledge prior, we propose a novel pipeline VFMSeg to further enhance the cross-modal unsupervised domain adaptation framework by leveraging these models. In this work, we study how to harness the knowledge priors learned by VFMs to produce more accurate labels for unlabeled target domains and improve overall performance. We first utilize a multi-modal VFM, which is pre-trained on large scale image-text pairs, to provide supervised labels (VFM-PL) for images and point clouds from the target domain. Then, another VFM trained on fine-grained 2D masks is adopted to guide the generation of semantically augmented images and point clouds to enhance the performance of neural networks, which mix the data from source and target domains like view frustums (FrustumMixing). Finally, we merge class-wise prediction across modalities to produce more accurate annotations for unlabeled target domains. Our method is evaluated on various autonomous driving datasets and the results demonstrate a significant improvement for 3D segmentation task.

Visual Foundation Models Boost Cross-Modal Unsupervised Domain Adaptation for 3D Semantic Segmentation

TL;DR

This work proposes a novel pipeline VFMSeg to further enhance the cross-modal unsupervised domain adaptation framework by leveraging these models and studies how to harness the knowledge priors learned by VFMs to produce more accurate labels for unlabeled target domains and improve overall performance.

Abstract

Unsupervised domain adaptation (UDA) is vital for alleviating the workload of labeling 3D point cloud data and mitigating the absence of labels when facing a newly defined domain. Various methods of utilizing images to enhance the performance of cross-domain 3D segmentation have recently emerged. However, the pseudo labels, which are generated from models trained on the source domain and provide additional supervised signals for the unseen domain, are inadequate when utilized for 3D segmentation due to their inherent noisiness and consequently restrict the accuracy of neural networks. With the advent of 2D visual foundation models (VFMs) and their abundant knowledge prior, we propose a novel pipeline VFMSeg to further enhance the cross-modal unsupervised domain adaptation framework by leveraging these models. In this work, we study how to harness the knowledge priors learned by VFMs to produce more accurate labels for unlabeled target domains and improve overall performance. We first utilize a multi-modal VFM, which is pre-trained on large scale image-text pairs, to provide supervised labels (VFM-PL) for images and point clouds from the target domain. Then, another VFM trained on fine-grained 2D masks is adopted to guide the generation of semantically augmented images and point clouds to enhance the performance of neural networks, which mix the data from source and target domains like view frustums (FrustumMixing). Finally, we merge class-wise prediction across modalities to produce more accurate annotations for unlabeled target domains. Our method is evaluated on various autonomous driving datasets and the results demonstrate a significant improvement for 3D segmentation task.
Paper Structure (35 sections, 3 equations, 11 figures, 10 tables, 3 algorithms)

This paper contains 35 sections, 3 equations, 11 figures, 10 tables, 3 algorithms.

Figures (11)

  • Figure 1: (a). Comparison between the traditional pseudo labels (Left) and pseudo labels from our VFM-PL (right). (b). Illustration of our FrustumMixing, narrowing the domain gap by mixing the source and target samples with the help of VFMs. Comparison of (c) traditional cross-domain UDA methods and (d) VFMSeg, where VFMSeg leverages the powerful prior of VFMs to boost UDA performance.
  • Figure 2: Framework overview. Both 2D and 3D neural networks are trained on source and target data. Hence, the domain-invariant feature is captured during parameter optimization. There are two projection heads in those networks. The first head leverages supervision signal within labels and the second head provides cross-modal information exchange through KL-Divergence (Sec \ref{['subsec:framework']}). Since the target domain is free of labels under the UDA setting, pre-trained 2D and 3D networks are first utilized to generate pseudo-labels for the target domain. VFM is applied to provide guidance for producing more accurate pseudo-labels (Sec \ref{['subsec:leverageVFM']}). The visual prior of a VFM is also leveraged to create diverse training samples that bridge the gap between two domains (\ref{['subsec:frustrumMix']}).
  • Figure 3: VFM-PL: Leveraging the visual prior for generating pseudo labels. We utilize VFM to provide guidance for generating pseudo-labels in the target domain. Since SEEM zou2023seem is trained on a huge amount of image-text pairs and segmentation masks across diverse scenes, its learned feature encoder is naturally resistant to domain shifts. By averaging the probabilistic prediction of pre-trained 2D network and SEEM, the generation of pseudo-labels can be more precise and robust.
  • Figure 4: FrustumMixing: VFM guided semantically mixing. To further enhance the capability of neural networks to bridge the gap across domains, we propose to utilize SAM Alexander2023sam to generate fine-grained 2D masks by feeding images from both domains. The image mixing is realized by using masks that are generated according to one image to cut out corresponding areas, then fill in these masked areas with respective pixels selected from the other image.
  • Figure 5: Qualitative results. We show the ensembling results of four scenarios by averaging the softmax outputs of 2D and 3D networks. Our method can improve the performance of 3D semantic segmentation. Noted that, by the merits of VFMs, our method can segment detailed objects very well. From top to bottom, the focused areas are the trunk of a tree, manmade objects under restricted lighting condition, the silhouette of a vehicle, and most importantly, a kid playing close to the road.
  • ...and 6 more figures