Table of Contents
Fetching ...

LISO: Lidar-only Self-Supervised 3D Object Detection

Stefan Baur, Frank Moosmann, Andreas Geiger

TL;DR

LISO tackles the challenge of lidar-only self-supervised 3D object detection by exploiting motion cues from unlabeled lidar sequences. It combines a self-supervised lidar scene flow network with a trajectory-regularized self-training loop to generate high-precision pseudo ground truth and iteratively refine a single-frame detector, enabling movement-from-motion generalization without cameras or GPS. Across four real-world datasets and two sota detector architectures, LISO consistently outperforms unsupervised baselines and narrows the gap with supervised methods, while maintaining robustness to sensor and mounting variations. The approach advances practical lidar-only autonomy by reducing labeling requirements and demonstrating strong cross-dataset generalization, with plans to release code for reproducibility.

Abstract

3D object detection is one of the most important components in any Self-Driving stack, but current state-of-the-art (SOTA) lidar object detectors require costly & slow manual annotation of 3D bounding boxes to perform well. Recently, several methods emerged to generate pseudo ground truth without human supervision, however, all of these methods have various drawbacks: Some methods require sensor rigs with full camera coverage and accurate calibration, partly supplemented by an auxiliary optical flow engine. Others require expensive high-precision localization to find objects that disappeared over multiple drives. We introduce a novel self-supervised method to train SOTA lidar object detection networks which works on unlabeled sequences of lidar point clouds only, which we call trajectory-regularized self-training. It utilizes a SOTA self-supervised lidar scene flow network under the hood to generate, track, and iteratively refine pseudo ground truth. We demonstrate the effectiveness of our approach for multiple SOTA object detection networks across multiple real-world datasets. Code will be released.

LISO: Lidar-only Self-Supervised 3D Object Detection

TL;DR

LISO tackles the challenge of lidar-only self-supervised 3D object detection by exploiting motion cues from unlabeled lidar sequences. It combines a self-supervised lidar scene flow network with a trajectory-regularized self-training loop to generate high-precision pseudo ground truth and iteratively refine a single-frame detector, enabling movement-from-motion generalization without cameras or GPS. Across four real-world datasets and two sota detector architectures, LISO consistently outperforms unsupervised baselines and narrows the gap with supervised methods, while maintaining robustness to sensor and mounting variations. The approach advances practical lidar-only autonomy by reducing labeling requirements and demonstrating strong cross-dataset generalization, with plans to release code for reproducibility.

Abstract

3D object detection is one of the most important components in any Self-Driving stack, but current state-of-the-art (SOTA) lidar object detectors require costly & slow manual annotation of 3D bounding boxes to perform well. Recently, several methods emerged to generate pseudo ground truth without human supervision, however, all of these methods have various drawbacks: Some methods require sensor rigs with full camera coverage and accurate calibration, partly supplemented by an auxiliary optical flow engine. Others require expensive high-precision localization to find objects that disappeared over multiple drives. We introduce a novel self-supervised method to train SOTA lidar object detection networks which works on unlabeled sequences of lidar point clouds only, which we call trajectory-regularized self-training. It utilizes a SOTA self-supervised lidar scene flow network under the hood to generate, track, and iteratively refine pseudo ground truth. We demonstrate the effectiveness of our approach for multiple SOTA object detection networks across multiple real-world datasets. Code will be released.
Paper Structure (22 sections, 2 equations, 10 figures, 7 tables)

This paper contains 22 sections, 2 equations, 10 figures, 7 tables.

Figures (10)

  • Figure 1: Objects predicted by our method using no manual annotations.Red boxes are ground truth boxes, yellow boxes are predicted by our network.
  • Figure 1: Performance comparison of clustering ground truth lidar scene flow (top) and SLIM slim lidar scene flow (bottom) on the nuScenes dataset. The methods are evaluated according to the official nuScenes protocol on the validation split. The dashed line represents the minimum threshold for precision and recall of $0.1$, all results below these two thresholds are discarded. This leads to the surprising effect that the AP score is higher when using SLIM lidar scene flow, but this is only a result of the clipping dictated by the nuScenes evaluation protocol.
  • Figure 2: Overview of the proposed method. Point cloud sequences are preprocessed (blue, Sec \ref{['sec:preprocessing']}), initial pseudo ground truth is created (orange, Sec. \ref{['sec:pgtgeneration']}) and the object detector is iteratively trained and pseudo ground truth regenerated (red, Sec. \ref{['sec:pgtgeneration']}).
  • Figure 2: Precision and recall of the (tracked) pseudo ground truth generated by Oyster and LISO over the course of self-training of Centerpoint on WOD (training split). Precision and recall are computed like in the AP metrics used in Fig. \ref{['fig:pgtNetQuality']} and Table \ref{['tab:waymo']}, i.e. true positives are occurences where the bev IoU between ground truth and predicted boxes is greater than 0.4, but at a specific confidence threshold: For Oyster, we use the reported value from the publication $c=0.4$oyster. For LISO, we use $c=0.3$ and only discard the learned weights every other round, as stated in Section \ref{['sec:pgtgeneration']}. Note that the dip in Oyster's performance at round 1 stems from the zero-shot generalization, where the network is tasked to generalize from the training on the initial pseudo ground truth generated on the smaller bev range to the full, previously unseen bev range, going from $50\times50\m$ to $100\times100\m$.
  • Figure 3: Overview over preprocessing, initial pseudo ground truth generation and training with examples.Top left: In the first step, self-supervised lidar scene flow is computed and corrected for vehicle ego-motion. Points are colored by flow direction and magnitude. Top right: In the second step, the scene flow is clustered and bounding boxes are fitted (to the moving objects). Bottom left: In the third step, the network is trained on the pseudo ground truth and is generalizing to static objects also, since it does not have the motion information as input signal. Points are thus colored by laser intensity. Bottom right: Ground truth, for reference.
  • ...and 5 more figures