Table of Contents
Fetching ...

Learning to Adapt SAM for Segmenting Cross-domain Point Clouds

Xidong Peng, Runnan Chen, Feng Qiao, Lingdong Kong, Youquan Liu, Yujing Sun, Tai Wang, Xinge Zhu, Yuexin Ma

TL;DR

This paper tackles unsupervised domain adaptation for 3D LiDAR segmentation by aligning both source and target point features to the general feature space of the Vision Foundation Model SAM, using RGB images as an offline bridge to unify 2D-3D representations. It introduces a SAM-guided 3D feature alignment loss $L_{align}$ and a novel Scene-Instance Hybrid Feature Augmentation to generate diverse cross-domain point clouds, enhancing alignment with SAM features. The method, evaluated on multiple cross-domain benchmarks, achieves state-of-the-art performance with large gains over strong baselines, and ablations confirm the critical roles of the SAM-guided alignment, augmentation strategies, and integration of alternative VFMs. The approach demonstrates robust cross-domain generalization, reduces reliance on target-domain labels, and suggests broader applicability to challenging tasks such as panoptic segmentation and domain generalization, with potential extensions to 3D detection.

Abstract

Unsupervised domain adaptation (UDA) in 3D segmentation tasks presents a formidable challenge, primarily stemming from the sparse and unordered nature of point cloud data. Especially for LiDAR point clouds, the domain discrepancy becomes obvious across varying capture scenes, fluctuating weather conditions, and the diverse array of LiDAR devices in use. While previous UDA methodologies have often sought to mitigate this gap by aligning features between source and target domains, this approach falls short when applied to 3D segmentation due to the substantial domain variations. Inspired by the remarkable generalization capabilities exhibited by the vision foundation model, SAM, in the realm of image segmentation, our approach leverages the wealth of general knowledge embedded within SAM to unify feature representations across diverse 3D domains and further solves the 3D domain adaptation problem. Specifically, we harness the corresponding images associated with point clouds to facilitate knowledge transfer and propose an innovative hybrid feature augmentation methodology, which significantly enhances the alignment between the 3D feature space and SAM's feature space, operating at both the scene and instance levels. Our method is evaluated on many widely-recognized datasets and achieves state-of-the-art performance.

Learning to Adapt SAM for Segmenting Cross-domain Point Clouds

TL;DR

This paper tackles unsupervised domain adaptation for 3D LiDAR segmentation by aligning both source and target point features to the general feature space of the Vision Foundation Model SAM, using RGB images as an offline bridge to unify 2D-3D representations. It introduces a SAM-guided 3D feature alignment loss and a novel Scene-Instance Hybrid Feature Augmentation to generate diverse cross-domain point clouds, enhancing alignment with SAM features. The method, evaluated on multiple cross-domain benchmarks, achieves state-of-the-art performance with large gains over strong baselines, and ablations confirm the critical roles of the SAM-guided alignment, augmentation strategies, and integration of alternative VFMs. The approach demonstrates robust cross-domain generalization, reduces reliance on target-domain labels, and suggests broader applicability to challenging tasks such as panoptic segmentation and domain generalization, with potential extensions to 3D detection.

Abstract

Unsupervised domain adaptation (UDA) in 3D segmentation tasks presents a formidable challenge, primarily stemming from the sparse and unordered nature of point cloud data. Especially for LiDAR point clouds, the domain discrepancy becomes obvious across varying capture scenes, fluctuating weather conditions, and the diverse array of LiDAR devices in use. While previous UDA methodologies have often sought to mitigate this gap by aligning features between source and target domains, this approach falls short when applied to 3D segmentation due to the substantial domain variations. Inspired by the remarkable generalization capabilities exhibited by the vision foundation model, SAM, in the realm of image segmentation, our approach leverages the wealth of general knowledge embedded within SAM to unify feature representations across diverse 3D domains and further solves the 3D domain adaptation problem. Specifically, we harness the corresponding images associated with point clouds to facilitate knowledge transfer and propose an innovative hybrid feature augmentation methodology, which significantly enhances the alignment between the 3D feature space and SAM's feature space, operating at both the scene and instance levels. Our method is evaluated on many widely-recognized datasets and achieves state-of-the-art performance.
Paper Structure (17 sections, 3 equations, 4 figures, 5 tables)

This paper contains 17 sections, 3 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: (a) Comparison of 3D UDA paradigms. Different from aligning two point feature domains directly, our method makes both the source domain and target domain align with the SAM feature space. (b) Visualization of the feature distance across different datasets, where smaller values indicate a more similar distribution. It is obvious that after mapping to SAM feature space, point feature distributions from disparate domains become much more aligned.
  • Figure 2: Pipeline of our method. The point cloud is fed into the point encoder for point embeddings at the top, and the corresponding images are passed through the SAM encoder for image embeddings at the bottom, from which we obtain SAM-guided point embedding with the 2D-3D projection. Alignment loss $L_{align}$ is calculated based on the SAM-guided features and original features. Furthermore, augmented inputs provide diverse feature patterns boosting the 3D-to-SAM feature alignment.
  • Figure 3: Hybrid feature augmentation by data mixing for better 3D-to-SAM feature alignment. Part(a) illustrate all the scene-level approaches including polar-based, range-based, and laser-based point mix-up, where different color represents points from distinct domain. Part(b) shows the data flow of mixing the point data with instance-level data from another domain with an instance mask, where we take source data as an example for instance-level point generation and vice versa.
  • Figure 4: Visualization of the domain adaptation from nuScenes to SemanticKITTI.