Table of Contents
Fetching ...

AutoInst: Automatic Instance-Based Segmentation of LiDAR 3D Scans

Cedric Perauer, Laurenz Adrian Heidrich, Haifan Zhang, Matthias Nießner, Anastasiia Kornilova, Alexey Artemov

TL;DR

This work tackles unsupervised instance-based segmentation for dense outdoor LiDAR scans, addressing the high cost of annotated data. It introduces a two-stage framework: first, generate initial pseudo-instances by building a weighted proxy-graph from multi-modal self-supervised features and applying Normalized Cuts within a neighborhood radius $R=1\, ext{m}$; second, self-train a refined 3D instance segmentation network to improve the proposals, operating on local chunks with a map-level merge. The method leverages multi-modal features (S, P, I) and edge weights $w_{pq}^{\mu}=e^{-\theta^{\mu}\|x_{p}^{\mu}-x_{q}^{\mu}\|^{2}}$, composing across modalities and optimizing a Dice-BCE loss $L=\lambda_{\text{dice}}L_{\text{dice}}+\lambda_{\text{bce}}L_{\text{bce}}$ during refinement; evaluated on SemanticKITTI, it achieves strong improvements over unsupervised baselines and competitive results versus supervised methods, illustrating effective label-free 3D instance segmentation for outdoor scenes. Overall, the approach reduces annotation requirements, demonstrates robustness to dynamic objects, and provides a practical pipeline for accelerating 3D scene understanding in outdoor environments.

Abstract

Recently, progress in acquisition equipment such as LiDAR sensors has enabled sensing increasingly spacious outdoor 3D environments. Making sense of such 3D acquisitions requires fine-grained scene understanding, such as constructing instance-based 3D scene segmentations. Commonly, a neural network is trained for this task; however, this requires access to a large, densely annotated dataset, which is widely known to be challenging to obtain. To address this issue, in this work we propose to predict instance segmentations for 3D scenes in an unsupervised way, without relying on ground-truth annotations. To this end, we construct a learning framework consisting of two components: (1) a pseudo-annotation scheme for generating initial unsupervised pseudo-labels; and (2) a self-training algorithm for instance segmentation to fit robust, accurate instances from initial noisy proposals. To enable generating 3D instance mask proposals, we construct a weighted proxy-graph by connecting 3D points with edges integrating multi-modal image- and point-based self-supervised features, and perform graph-cuts to isolate individual pseudo-instances. We then build on a state-of-the-art point-based architecture and train a 3D instance segmentation model, resulting in significant refinement of initial proposals. To scale to arbitrary complexity 3D scenes, we design our algorithm to operate on local 3D point chunks and construct a merging step to generate scene-level instance segmentations. Experiments on the challenging SemanticKITTI benchmark demonstrate the potential of our approach, where it attains 13.3% higher Average Precision and 9.1% higher F1 score compared to the best-performing baseline. The code will be made publicly available at https://github.com/artonson/autoinst.

AutoInst: Automatic Instance-Based Segmentation of LiDAR 3D Scans

TL;DR

This work tackles unsupervised instance-based segmentation for dense outdoor LiDAR scans, addressing the high cost of annotated data. It introduces a two-stage framework: first, generate initial pseudo-instances by building a weighted proxy-graph from multi-modal self-supervised features and applying Normalized Cuts within a neighborhood radius ; second, self-train a refined 3D instance segmentation network to improve the proposals, operating on local chunks with a map-level merge. The method leverages multi-modal features (S, P, I) and edge weights , composing across modalities and optimizing a Dice-BCE loss during refinement; evaluated on SemanticKITTI, it achieves strong improvements over unsupervised baselines and competitive results versus supervised methods, illustrating effective label-free 3D instance segmentation for outdoor scenes. Overall, the approach reduces annotation requirements, demonstrates robustness to dynamic objects, and provides a practical pipeline for accelerating 3D scene understanding in outdoor environments.

Abstract

Recently, progress in acquisition equipment such as LiDAR sensors has enabled sensing increasingly spacious outdoor 3D environments. Making sense of such 3D acquisitions requires fine-grained scene understanding, such as constructing instance-based 3D scene segmentations. Commonly, a neural network is trained for this task; however, this requires access to a large, densely annotated dataset, which is widely known to be challenging to obtain. To address this issue, in this work we propose to predict instance segmentations for 3D scenes in an unsupervised way, without relying on ground-truth annotations. To this end, we construct a learning framework consisting of two components: (1) a pseudo-annotation scheme for generating initial unsupervised pseudo-labels; and (2) a self-training algorithm for instance segmentation to fit robust, accurate instances from initial noisy proposals. To enable generating 3D instance mask proposals, we construct a weighted proxy-graph by connecting 3D points with edges integrating multi-modal image- and point-based self-supervised features, and perform graph-cuts to isolate individual pseudo-instances. We then build on a state-of-the-art point-based architecture and train a 3D instance segmentation model, resulting in significant refinement of initial proposals. To scale to arbitrary complexity 3D scenes, we design our algorithm to operate on local 3D point chunks and construct a merging step to generate scene-level instance segmentations. Experiments on the challenging SemanticKITTI benchmark demonstrate the potential of our approach, where it attains 13.3% higher Average Precision and 9.1% higher F1 score compared to the best-performing baseline. The code will be made publicly available at https://github.com/artonson/autoinst.
Paper Structure (27 sections, 4 equations, 7 figures, 6 tables)

This paper contains 27 sections, 4 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: For unsupervised instance segmentation of registered LiDAR 3D scans (a), we integrate multi-modal self-supervised deep features into a weighted proxy-graph, making cuts for generation of instance mask proposals (c) and performing their self-trained refinement (d). Our algorithm is label-free and outperforms unsupervised baselines (b).
  • Figure 2: Overview of our unsupervised 3D instance segmentation framework. We start with a sequence of posed 3D LiDAR scans and RGB images, registering their static segments into a dense 3D map but operate with local overlapping chunks (a). To generate 3D instance mask proposals, we assign multi-modal features to individual 3D points, connecting them into a weighted proxy-graph; we cut the graph to obtain the coarse 3D mask proposals (b). For refinement of 3D instance masks, we start with the coarse proposals and perform multiple rounds of self-training, gradually reintegrating confident instance predicions as ground-truth (c). Map-level segmentation is obtained by merging instances predicted in individual chunks (d).
  • Figure 3: Pointwise (a) and pixelwise (b) similarity maps \ref{['eq:similarity_score']} for TARL nunes2023temporal and DINOv2 oquab2023dinov2 models, respectively. Following the intuition from prior research keetha2023anyloc, we select the output of query-11 (red box) as our image-based feature map.
  • Figure 4: Influence of hyperparameters $\theta^{\text{P}}, \theta^{\text{I}}$ in \ref{['eq:similarity_score']} on distribution of instance mask proposals, as captured by the volume occupied by instances (horizontal axis). $\theta^{\text{P}}_0 = 0.5, \theta^{\text{I}}_0 = 0.1$.
  • Figure 5: Comparative instance-based segmentation results on SemanticKITTI behley2019semantickitti.
  • ...and 2 more figures