AutoInst: Automatic Instance-Based Segmentation of LiDAR 3D Scans
Cedric Perauer, Laurenz Adrian Heidrich, Haifan Zhang, Matthias Nießner, Anastasiia Kornilova, Alexey Artemov
TL;DR
This work tackles unsupervised instance-based segmentation for dense outdoor LiDAR scans, addressing the high cost of annotated data. It introduces a two-stage framework: first, generate initial pseudo-instances by building a weighted proxy-graph from multi-modal self-supervised features and applying Normalized Cuts within a neighborhood radius $R=1\, ext{m}$; second, self-train a refined 3D instance segmentation network to improve the proposals, operating on local chunks with a map-level merge. The method leverages multi-modal features (S, P, I) and edge weights $w_{pq}^{\mu}=e^{-\theta^{\mu}\|x_{p}^{\mu}-x_{q}^{\mu}\|^{2}}$, composing across modalities and optimizing a Dice-BCE loss $L=\lambda_{\text{dice}}L_{\text{dice}}+\lambda_{\text{bce}}L_{\text{bce}}$ during refinement; evaluated on SemanticKITTI, it achieves strong improvements over unsupervised baselines and competitive results versus supervised methods, illustrating effective label-free 3D instance segmentation for outdoor scenes. Overall, the approach reduces annotation requirements, demonstrates robustness to dynamic objects, and provides a practical pipeline for accelerating 3D scene understanding in outdoor environments.
Abstract
Recently, progress in acquisition equipment such as LiDAR sensors has enabled sensing increasingly spacious outdoor 3D environments. Making sense of such 3D acquisitions requires fine-grained scene understanding, such as constructing instance-based 3D scene segmentations. Commonly, a neural network is trained for this task; however, this requires access to a large, densely annotated dataset, which is widely known to be challenging to obtain. To address this issue, in this work we propose to predict instance segmentations for 3D scenes in an unsupervised way, without relying on ground-truth annotations. To this end, we construct a learning framework consisting of two components: (1) a pseudo-annotation scheme for generating initial unsupervised pseudo-labels; and (2) a self-training algorithm for instance segmentation to fit robust, accurate instances from initial noisy proposals. To enable generating 3D instance mask proposals, we construct a weighted proxy-graph by connecting 3D points with edges integrating multi-modal image- and point-based self-supervised features, and perform graph-cuts to isolate individual pseudo-instances. We then build on a state-of-the-art point-based architecture and train a 3D instance segmentation model, resulting in significant refinement of initial proposals. To scale to arbitrary complexity 3D scenes, we design our algorithm to operate on local 3D point chunks and construct a merging step to generate scene-level instance segmentations. Experiments on the challenging SemanticKITTI benchmark demonstrate the potential of our approach, where it attains 13.3% higher Average Precision and 9.1% higher F1 score compared to the best-performing baseline. The code will be made publicly available at https://github.com/artonson/autoinst.
