Table of Contents
Fetching ...

Unsupervised Online 3D Instance Segmentation with Synthetic Sequences and Dynamic Loss

Yifan Zhang, Wei Zhang, Chuangxin He, Zhonghua Miao, Junhui Hou

TL;DR

This work tackles unsupervised online 3D instance segmentation for sequential LiDAR data, addressing the lack of labels and the need for temporal consistency. It proposes a cohesive framework that combines spatio-temporal pseudo-labeling, synthetic point cloud sequence generation, and an online auto-regressive segmenter trained with unsupervised losses, including a time-consistency term and dynamic sample weighting. The approach yields significant improvements over UNIT and other baselines on SemanticKITTI, nuScenes, and PandaSet, particularly in temporal association and Best IoU, while preserving real-time inference. Overall, the method enhances robustness, generalization, and efficiency for online 3D instance segmentation in dynamic environments.

Abstract

Unsupervised online 3D instance segmentation is a fundamental yet challenging task, as it requires maintaining consistent object identities across LiDAR scans without relying on annotated training data. Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity, rigid temporal sampling, and heavy dependence on noisy pseudo-labels. We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation, enabling greater diversity without relying on manual labels or simulation engines. To better capture temporal dynamics, our method incorporates a flexible sampling strategy that leverages both adjacent and non-adjacent frames, allowing the model to learn from long-range dependencies as well as short-term variations. In addition, a dynamic-weighting loss emphasizes confident and informative samples, guiding the network toward more robust representations. Through extensive experiments on SemanticKITTI, nuScenes, and PandaSet, our method consistently outperforms UNIT and other unsupervised baselines, achieving higher segmentation accuracy and more robust temporal associations. The code will be publicly available at github.com/Eaphan/SFT3D.

Unsupervised Online 3D Instance Segmentation with Synthetic Sequences and Dynamic Loss

TL;DR

This work tackles unsupervised online 3D instance segmentation for sequential LiDAR data, addressing the lack of labels and the need for temporal consistency. It proposes a cohesive framework that combines spatio-temporal pseudo-labeling, synthetic point cloud sequence generation, and an online auto-regressive segmenter trained with unsupervised losses, including a time-consistency term and dynamic sample weighting. The approach yields significant improvements over UNIT and other baselines on SemanticKITTI, nuScenes, and PandaSet, particularly in temporal association and Best IoU, while preserving real-time inference. Overall, the method enhances robustness, generalization, and efficiency for online 3D instance segmentation in dynamic environments.

Abstract

Unsupervised online 3D instance segmentation is a fundamental yet challenging task, as it requires maintaining consistent object identities across LiDAR scans without relying on annotated training data. Existing methods, such as UNIT, have made progress in this direction but remain constrained by limited training diversity, rigid temporal sampling, and heavy dependence on noisy pseudo-labels. We propose a new framework that enriches the training distribution through synthetic point cloud sequence generation, enabling greater diversity without relying on manual labels or simulation engines. To better capture temporal dynamics, our method incorporates a flexible sampling strategy that leverages both adjacent and non-adjacent frames, allowing the model to learn from long-range dependencies as well as short-term variations. In addition, a dynamic-weighting loss emphasizes confident and informative samples, guiding the network toward more robust representations. Through extensive experiments on SemanticKITTI, nuScenes, and PandaSet, our method consistently outperforms UNIT and other unsupervised baselines, achieving higher segmentation accuracy and more robust temporal associations. The code will be publicly available at github.com/Eaphan/SFT3D.

Paper Structure

This paper contains 22 sections, 14 equations, 6 figures, 9 tables, 1 algorithm.

Figures (6)

  • Figure 1: The overall framework. (a) We obtain initial pseudo labels with spatial-temporal clustering. (b) We synthesize the new point cloud sequence for training. (c) We train the network with the pseudo-labels. The samples are flexibly taken from a point cloud sequence, and the losses are adjusted with dynamic weights.
  • Figure 2: Illustration of the point cloud sequence synthesis process. (a) Input LiDAR point cloud. (b) Ground points extracted and retained. (c) ValidMap construction, where valid placement regions are highlighted in red. (d) Final synthesized scene with objects placed into valid regions, resulting in augmented LiDAR point clouds.
  • Figure 3: Visual results on SemanticKITTI. From left to right: ground-truth labels (GT), pseudo-labels, predictions from UNIT, and results from ours. Our approach produces cleaner and more consistent instance segmentation, especially in challenging regions (errors from UNIT are highlighted with red circles). All results are class-agnostic, where colors indicate different instances.
  • Figure 4: Visualization of the (a) confidence-based scaling factor and (b) motion-based weight vector for training.
  • Figure 5: Visual results on the nuScenes dataset. Our method produces more consistent instance masks compared with the baseline UNIT. Best viewed in zoom.
  • ...and 1 more figures