Table of Contents
Fetching ...

PointDC:Unsupervised Semantic Segmentation of 3D Point Clouds via Cross-modal Distillation and Super-Voxel Clustering

Zisheng Chen, Hongbin Xu, Weitao Chen, Zhipeng Zhou, Haihong Xiao, Baigui Sun, Xuansong Xie, Wenxiong Kang

TL;DR

This work addresses the challenge of fully unsupervised semantic segmentation for 3D point clouds, where no annotations are available. It introduces PointDC, a two-stage framework comprising Cross-Modal Distillation (CMD) to transfer semantic cues from multi-view 2D visuals into 3D point features, and Super-Voxel Clustering (SVC) to perform iterative, voxel-level clustering with subsequent point-wise training. The method achieves substantial improvements over prior unsupervised approaches on ScanNet-v2 and S3DIS, with reported gains of $+$18.4 mIoU and $+$11.5 mIoU, respectively, and maintains robustness across backbones. By leveraging cross-modal supervision and voxel-based regularization, PointDC demonstrates a practical pathway toward annotation-free 3D scene understanding with potential impact on robotics and autonomous navigation.

Abstract

Semantic segmentation of point clouds usually requires exhausting efforts of human annotations, hence it attracts wide attention to the challenging topic of learning from unlabeled or weaker forms of annotations. In this paper, we take the first attempt for fully unsupervised semantic segmentation of point clouds, which aims to delineate semantically meaningful objects without any form of annotations. Previous works of unsupervised pipeline on 2D images fails in this task of point clouds, due to: 1) Clustering Ambiguity caused by limited magnitude of data and imbalanced class distribution; 2) Irregularity Ambiguity caused by the irregular sparsity of point cloud. Therefore, we propose a novel framework, PointDC, which is comprised of two steps that handle the aforementioned problems respectively: Cross-Modal Distillation (CMD) and Super-Voxel Clustering (SVC). In the first stage of CMD, multi-view visual features are back-projected to the 3D space and aggregated to a unified point feature to distill the training of the point representation. In the second stage of SVC, the point features are aggregated to super-voxels and then fed to the iterative clustering process for excavating semantic classes. PointDC yields a significant improvement over the prior state-of-the-art unsupervised methods, on both the ScanNet-v2 (+18.4 mIoU) and S3DIS (+11.5 mIoU) semantic segmentation benchmarks.

PointDC:Unsupervised Semantic Segmentation of 3D Point Clouds via Cross-modal Distillation and Super-Voxel Clustering

TL;DR

This work addresses the challenge of fully unsupervised semantic segmentation for 3D point clouds, where no annotations are available. It introduces PointDC, a two-stage framework comprising Cross-Modal Distillation (CMD) to transfer semantic cues from multi-view 2D visuals into 3D point features, and Super-Voxel Clustering (SVC) to perform iterative, voxel-level clustering with subsequent point-wise training. The method achieves substantial improvements over prior unsupervised approaches on ScanNet-v2 and S3DIS, with reported gains of 18.4 mIoU and 11.5 mIoU, respectively, and maintains robustness across backbones. By leveraging cross-modal supervision and voxel-based regularization, PointDC demonstrates a practical pathway toward annotation-free 3D scene understanding with potential impact on robotics and autonomous navigation.

Abstract

Semantic segmentation of point clouds usually requires exhausting efforts of human annotations, hence it attracts wide attention to the challenging topic of learning from unlabeled or weaker forms of annotations. In this paper, we take the first attempt for fully unsupervised semantic segmentation of point clouds, which aims to delineate semantically meaningful objects without any form of annotations. Previous works of unsupervised pipeline on 2D images fails in this task of point clouds, due to: 1) Clustering Ambiguity caused by limited magnitude of data and imbalanced class distribution; 2) Irregularity Ambiguity caused by the irregular sparsity of point cloud. Therefore, we propose a novel framework, PointDC, which is comprised of two steps that handle the aforementioned problems respectively: Cross-Modal Distillation (CMD) and Super-Voxel Clustering (SVC). In the first stage of CMD, multi-view visual features are back-projected to the 3D space and aggregated to a unified point feature to distill the training of the point representation. In the second stage of SVC, the point features are aggregated to super-voxels and then fed to the iterative clustering process for excavating semantic classes. PointDC yields a significant improvement over the prior state-of-the-art unsupervised methods, on both the ScanNet-v2 (+18.4 mIoU) and S3DIS (+11.5 mIoU) semantic segmentation benchmarks.
Paper Structure (15 sections, 6 equations, 5 figures, 5 tables)

This paper contains 15 sections, 6 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: From unannotated point clouds, we would like a segmentation system to discover the semantic concepts automatically without any supervision.
  • Figure 2: Overview of PointDC framework. The training contains 2 steps: Cross-Modal Distillation and Super-Voxel Clustering.
  • Figure 3: Visualization of the clustering results among multi-view feature maps extracted by DINO dino. It demonstrates that the multi-view features are semantically correlated.
  • Figure 4: Qualitative comparison of unsupervised segmentation on ScanNet-v2 validation set. Each of the aligned ground truth labels and clusters is assigned a color. For better understanding, we show some the color and name matches in the bottom.
  • Figure 5: Visualization of PointDC's segmentation results under different iterations during training.