Table of Contents
Fetching ...

UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes

David Rozenberszki, Or Litany, Angela Dai

TL;DR

UnScene3D tackles the problem of fully unsupervised 3D instance segmentation in indoor scenes by combining multi-modal self-supervised cues from RGB-D data with a geometry-aware graph approach. It first generates sparse, high-quality pseudo masks on geometric primitives via a 3D NCut-based process, then densifies these masks through a multi-round self-training loop using a 3D transformer backbone. The key contributions are (i) a multi-modal, geometry-guided pseudo mask generation on mesh primitives, (ii) a robust self-training scheme that progressively expands and refines instance predictions without any manual annotations, and (iii) strong empirical results showing large improvements over prior unsupervised methods on ScanNet, S3DIS, and ARKitScenes, plus data-efficient pretraining benefits. This approach enables scalable, class-agnostic 3D instance segmentation for cluttered indoor environments, with potential for broad impact on autonomous robotics, AR/VR, and 3D scene understanding where labeled data is scarce.

Abstract

3D instance segmentation is fundamental to geometric understanding of the world around us. Existing methods for instance segmentation of 3D scenes rely on supervision from expensive, manual 3D annotations. We propose UnScene3D, the first fully unsupervised 3D learning approach for class-agnostic 3D instance segmentation of indoor scans. UnScene3D first generates pseudo masks by leveraging self-supervised color and geometry features to find potential object regions. We operate on a basis of geometric oversegmentation, enabling efficient representation and learning on high-resolution 3D data. The coarse proposals are then refined through self-training our model on its predictions. Our approach improves over state-of-the-art unsupervised 3D instance segmentation methods by more than 300% Average Precision score, demonstrating effective instance segmentation even in challenging, cluttered 3D scenes.

UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes

TL;DR

UnScene3D tackles the problem of fully unsupervised 3D instance segmentation in indoor scenes by combining multi-modal self-supervised cues from RGB-D data with a geometry-aware graph approach. It first generates sparse, high-quality pseudo masks on geometric primitives via a 3D NCut-based process, then densifies these masks through a multi-round self-training loop using a 3D transformer backbone. The key contributions are (i) a multi-modal, geometry-guided pseudo mask generation on mesh primitives, (ii) a robust self-training scheme that progressively expands and refines instance predictions without any manual annotations, and (iii) strong empirical results showing large improvements over prior unsupervised methods on ScanNet, S3DIS, and ARKitScenes, plus data-efficient pretraining benefits. This approach enables scalable, class-agnostic 3D instance segmentation for cluttered indoor environments, with potential for broad impact on autonomous robotics, AR/VR, and 3D scene understanding where labeled data is scarce.

Abstract

3D instance segmentation is fundamental to geometric understanding of the world around us. Existing methods for instance segmentation of 3D scenes rely on supervision from expensive, manual 3D annotations. We propose UnScene3D, the first fully unsupervised 3D learning approach for class-agnostic 3D instance segmentation of indoor scans. UnScene3D first generates pseudo masks by leveraging self-supervised color and geometry features to find potential object regions. We operate on a basis of geometric oversegmentation, enabling efficient representation and learning on high-resolution 3D data. The coarse proposals are then refined through self-training our model on its predictions. Our approach improves over state-of-the-art unsupervised 3D instance segmentation methods by more than 300% Average Precision score, demonstrating effective instance segmentation even in challenging, cluttered 3D scenes.
Paper Structure (43 sections, 10 figures, 12 tables, 1 algorithm)

This paper contains 43 sections, 10 figures, 12 tables, 1 algorithm.

Figures (10)

  • Figure 1: We propose UnScene3D, a fully-unsupervised 3D instance segmentation method, effectively separating semantic instances without requiring any manual annotations. We utilize geometric primitives to ensure crisp masks, and due to our self-training loop, we can also obtain a dense set of predictions, even in cluttered indoor scenarios.
  • Figure 2: UnScene3D first generates a set of pseudo masks (top) to initiate self-training (bottom) for unsupervised 3D instance segmentation. We leverage features from 3D self-supervised pre-training in combination with 2D self-supervised features on an input mesh. These multi-modal features are then aggregated on geometric primitives, integrating low- and high-level signals for pseudo mask segmentation. These initial pseudo masks are then used as supervision for a 3D transformer-based model to produce updated instance masks that are integrated into the supervision of multiple self-training cycles. Finally, we obtain clean and dense instance segmentation without using any manual annotations.
  • Figure 3: Qualitative comparison on ScanNet dai2017scannet scenes with projected predictions from the 2D method CutLER wang2023cut, traditional clustering-based methods Felzenszwalb felzenszwalb2004efficient and HDBSCAN mcinnes2017accelerated_hdbscan, and the GraphCut-based cluster refinement method nunes2022unsupervised. Our approach leverages strong pseudo mask prediction and a self-training strategy to produce cleaner, more accurate instance segmentation.
  • Figure 4: Our unsupervised self-training produces strong 3D features that can served as a powerful pretraining strategy for 3D instance segmentation in limited data scenarios. UnScene3D significantly outperforms state-of-the-art self-supervised 3D pretraining hou2021exploring on ScanNet instance segmentation.
  • Figure 5: Initial pseudo masks generated by UnScene3D in comparison with a 3D-lifted FreeMask wang2022freesolo. FreeMask tends to produce a larger set of noisier pseudo masks, while we rely on a cleaner but sparser set for our self-training.
  • ...and 5 more figures