Table of Contents
Fetching ...

Multi-View Projection for Unsupervised Domain Adaptation in 3D Semantic Segmentation

Andrew Caunes, Thierry Chateau, Vincent Fremont

TL;DR

This work tackles domain shift in LiDAR-based 3D semantic segmentation by introducing a multi-view projection framework that creates large-scale PC2D datasets from aligned 3D scenes. An ensemble of 2D segmentation models is trained on multiple modalities and views, with occlusion-aware back-projection used to generate dense 3D pseudo-labels for the target domain. The approach achieves state-of-the-art results in Real-to-Real UDA, demonstrates strong performance on large, structured classes in Simulation-to-Real, and enables rare-class segmentation by leveraging 2D annotations for target classes. By leveraging 3D annotations to train 2D models in the PC2D domain and using a simple voting scheme, the method provides a practical, scalable pathway for robust 3D segmentation across diverse domains.

Abstract

3D semantic segmentation is essential for autonomous driving and road infrastructure analysis, but state-of-the-art 3D models suffer from severe domain shift when applied across datasets. We propose a multi-view projection framework for unsupervised domain adaptation (UDA). Our method aligns LiDAR scans into coherent 3D scenes and renders them from multiple virtual camera poses to generate large-scale synthetic 2D datasets (PC2D) in various modalities. An ensemble of 2D segmentation models is trained on these modalities, and during inference, hundreds of views per scene are processed, with logits back-projected to 3D using an occlusion-aware voting scheme to produce point-wise labels. These labels can be used directly or to fine-tune a 3D segmentation model in the target domain. We evaluate our approach in both Real-to-Real and Simulation-to-Real UDA, achieving state-of-the-art performance in the Real-to-Real setting. Furthermore, we show that our framework enables segmentation of rare classes, leveraging only 2D annotations for those classes while relying on 3D annotations for others in the source domain.

Multi-View Projection for Unsupervised Domain Adaptation in 3D Semantic Segmentation

TL;DR

This work tackles domain shift in LiDAR-based 3D semantic segmentation by introducing a multi-view projection framework that creates large-scale PC2D datasets from aligned 3D scenes. An ensemble of 2D segmentation models is trained on multiple modalities and views, with occlusion-aware back-projection used to generate dense 3D pseudo-labels for the target domain. The approach achieves state-of-the-art results in Real-to-Real UDA, demonstrates strong performance on large, structured classes in Simulation-to-Real, and enables rare-class segmentation by leveraging 2D annotations for target classes. By leveraging 3D annotations to train 2D models in the PC2D domain and using a simple voting scheme, the method provides a practical, scalable pathway for robust 3D segmentation across diverse domains.

Abstract

3D semantic segmentation is essential for autonomous driving and road infrastructure analysis, but state-of-the-art 3D models suffer from severe domain shift when applied across datasets. We propose a multi-view projection framework for unsupervised domain adaptation (UDA). Our method aligns LiDAR scans into coherent 3D scenes and renders them from multiple virtual camera poses to generate large-scale synthetic 2D datasets (PC2D) in various modalities. An ensemble of 2D segmentation models is trained on these modalities, and during inference, hundreds of views per scene are processed, with logits back-projected to 3D using an occlusion-aware voting scheme to produce point-wise labels. These labels can be used directly or to fine-tune a 3D segmentation model in the target domain. We evaluate our approach in both Real-to-Real and Simulation-to-Real UDA, achieving state-of-the-art performance in the Real-to-Real setting. Furthermore, we show that our framework enables segmentation of rare classes, leveraging only 2D annotations for those classes while relying on 3D annotations for others in the source domain.

Paper Structure

This paper contains 26 sections, 2 figures, 5 tables.

Figures (2)

  • Figure 1: Overview of Seg_3D_by_PC2D .Training Phase. Starting from a 3D scenes (LiDAR scans in a common frame) and 3D segmentation masks, virtual camera poses are sampled around each scene. 2D images along with 2D segmentation masks are then rendered from the poses in a chosen modality. See Figure \ref{['fig:PC2D_generation']} for details on this dataset generation step. Each generated dataset is then used to train an individual 2D semantic segmentation model, which together form an ensemble. Inference Phase. A 3D scene is processed by generating virtual camera poses and rendering views in each modality, similarly to the training phase. Each model of the ensemble is then used to process the views in the corresponding modality. The resulting 2D logits are back-projected to 3D and accumulated as votes. The final 3D mask is obtained by assigning the most voted class to each point.
  • Figure 2: PointCloud2D (PC2D) dataset generation pipeline. A 2D semantic segmentation dataset of rendered views of 3D pointclouds is generated from a 3D semantic segmentation dataset. To obtain diverse images, virtual camera poses of 4 categories are sampled around the 3D scenes. The modality used for rendering can be chosen among RGB, Intensity, Depth and Normals. For each scene and each camera pose category, a large number of camera poses are sampled, and the corresponding 2D images and segmentation masks are rendered.