UnScene3D: Unsupervised 3D Instance Segmentation for Indoor Scenes
David Rozenberszki, Or Litany, Angela Dai
TL;DR
UnScene3D tackles the problem of fully unsupervised 3D instance segmentation in indoor scenes by combining multi-modal self-supervised cues from RGB-D data with a geometry-aware graph approach. It first generates sparse, high-quality pseudo masks on geometric primitives via a 3D NCut-based process, then densifies these masks through a multi-round self-training loop using a 3D transformer backbone. The key contributions are (i) a multi-modal, geometry-guided pseudo mask generation on mesh primitives, (ii) a robust self-training scheme that progressively expands and refines instance predictions without any manual annotations, and (iii) strong empirical results showing large improvements over prior unsupervised methods on ScanNet, S3DIS, and ARKitScenes, plus data-efficient pretraining benefits. This approach enables scalable, class-agnostic 3D instance segmentation for cluttered indoor environments, with potential for broad impact on autonomous robotics, AR/VR, and 3D scene understanding where labeled data is scarce.
Abstract
3D instance segmentation is fundamental to geometric understanding of the world around us. Existing methods for instance segmentation of 3D scenes rely on supervision from expensive, manual 3D annotations. We propose UnScene3D, the first fully unsupervised 3D learning approach for class-agnostic 3D instance segmentation of indoor scans. UnScene3D first generates pseudo masks by leveraging self-supervised color and geometry features to find potential object regions. We operate on a basis of geometric oversegmentation, enabling efficient representation and learning on high-resolution 3D data. The coarse proposals are then refined through self-training our model on its predictions. Our approach improves over state-of-the-art unsupervised 3D instance segmentation methods by more than 300% Average Precision score, demonstrating effective instance segmentation even in challenging, cluttered 3D scenes.
