Table of Contents
Fetching ...

Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation

Ziliang Miao, Runjian Chen, Yixi Cai, Buwei He, Wenquan Zhao, Wenqi Shao, Bo Zhang, Fu Zhang

TL;DR

This work tackles the labeling burden in LiDAR Moving Object Segmentation (MOS) by introducing Temporal Overlapping Prediction (TOP), a self-supervised pre-training method that leverages occupancy changes of temporal overlapping points observed across the current and adjacent LiDAR scans. TOP pre-trains a sparse 4D UNet encoder by predicting occupancy states of overlapping points and by reconstructing current scene occupancy, avoiding noisy flow learning inherent in forecasting approaches. Through extensive few-shot and cross-dataset experiments on nuScenes and SemanticKITTI, TOP consistently improves object-level Recall$_{\text{obj}}$ and, to a degree, IoU$_{\text{w/o}}$, demonstrating strong transferability across LiDAR setups and applicability to related tasks like semantic segmentation. The results underscore the method’s practical significance for robust dynamic object perception in autonomous systems, with potential extensions to other temporal perception tasks.

Abstract

Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose Temporal Overlapping Prediction (TOP), a self-supervised pre-training method that alleviate the labeling burden for MOS. TOP explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre-training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection-over-Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called mIoU_obj to evaluate object-level performance. Experiments on nuScenes and SemanticKITTI show that TOPoutperforms both supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre-trained models will be publicly available upon publication.

Temporal Overlapping Prediction: A Self-supervised Pre-training Method for LiDAR Moving Object Segmentation

TL;DR

This work tackles the labeling burden in LiDAR Moving Object Segmentation (MOS) by introducing Temporal Overlapping Prediction (TOP), a self-supervised pre-training method that leverages occupancy changes of temporal overlapping points observed across the current and adjacent LiDAR scans. TOP pre-trains a sparse 4D UNet encoder by predicting occupancy states of overlapping points and by reconstructing current scene occupancy, avoiding noisy flow learning inherent in forecasting approaches. Through extensive few-shot and cross-dataset experiments on nuScenes and SemanticKITTI, TOP consistently improves object-level Recall and, to a degree, IoU, demonstrating strong transferability across LiDAR setups and applicability to related tasks like semantic segmentation. The results underscore the method’s practical significance for robust dynamic object perception in autonomous systems, with potential extensions to other temporal perception tasks.

Abstract

Moving object segmentation (MOS) on LiDAR point clouds is crucial for autonomous systems like self-driving vehicles. Previous supervised approaches rely heavily on costly manual annotations, while LiDAR sequences naturally capture temporal motion cues that can be leveraged for self-supervised learning. In this paper, we propose Temporal Overlapping Prediction (TOP), a self-supervised pre-training method that alleviate the labeling burden for MOS. TOP explores the temporal overlapping points that commonly observed by current and adjacent scans, and learns spatiotemporal representations by predicting the occupancy states of temporal overlapping points. Moreover, we utilize current occupancy reconstruction as an auxiliary pre-training objective, which enhances the current structural awareness of the model. We conduct extensive experiments and observe that the conventional metric Intersection-over-Union (IoU) shows strong bias to objects with more scanned points, which might neglect small or distant objects. To compensate for this bias, we introduce an additional metric called mIoU_obj to evaluate object-level performance. Experiments on nuScenes and SemanticKITTI show that TOPoutperforms both supervised training-from-scratch baseline and other self-supervised pre-training baselines by up to 28.77% relative improvement, demonstrating strong transferability across LiDAR setups and generalization to other tasks. Code and pre-trained models will be publicly available upon publication.

Paper Structure

This paper contains 20 sections, 14 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Temporal occupancy inconsistency caused by the relative motion between the sensor and the person. The red dotted lines track fixed points (temporal overlapping points) in space across time, illustrating how their occupancy states change. The point colors denote the occupancy state, while the index indicates the observation time.
  • Figure 2: LiDAR beam divergence: The diverging beam is illustrated as the blue cone. The reported point (black) lies on the beam's centerline, while the actual beam hit-point (red) can be anywhere within the beam's footprint. LiDAR occupancy measurement: We model the space from the sensor to the reported LiDAR point as free with maximum confidence. Beyond this point, confidence decays exponentially. The region immediately following the reported point is considered occupied if its confidence is above a predefined threshold. The subsequent region where the confidence falls below the threshold is considered unknown.
  • Figure 3: Overall pipeline. Pre-processing: (a) shows the coplanarity condition of two beams. (b) If the spatial angle between two beams $\alpha_{i,j}$ exceeds the beam divergence angle $\theta_{\text{dvg}}$, we sample the beams' intersection point as the temporal overlapping point. The red and blue areas indicate beam divergence. (c) When $\alpha_{i,j}$ is less than $\theta_{\text{dvg}}$, we calculate the intersection segment and sample temporal overlapping points from it. Pre-training: (d) The input sequence is encoded by a sparse 4D UNet. For both pre-training objectives, a shallow MLP decoder predicts occupancy states based on the point's positional encoding and the feature of its corresponding LiDAR point.
  • Figure 4: Qualitative results of the nuScenes MOS.
  • Figure 5: Qualitative results of the SemanticKITTI MOS.
  • ...and 3 more figures