Table of Contents
Fetching ...

Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene

Ruiyang Zhang, Hu Zhang, Hang Yu, Zhedong Zheng

TL;DR

LiSe addresses unsupervised 3D object detection in sparse LiDAR environments by integrating LiDAR with rich 2D scene cues through a self-paced learning framework. It generates pseudo labels from both LiDAR (multi-traversal) and open-vocabulary 2D detectors with SAM, then refines them via a distance-aware fusion strategy and an adaptive sampling mechanism that emphasizes long-tail, distant, and small objects. A weak model aggregation scheme combines diverse snapshots into a robust final model, yielding substantial improvements over prior methods on nuScenes and Lyft, including long-range BEV performance that can surpass fully supervised baselines. The work demonstrates that open-vocabulary 2D priors, when carefully fused with LiDAR data and guided by self-paced learning, significantly enhance unsupervised 3D detection and offer practical benefits for robust autonomous navigation.

Abstract

The unsupervised 3D object detection is to accurately detect objects in unstructured environments with no explicit supervisory signals. This task, given sparse LiDAR point clouds, often results in compromised performance for detecting distant or small objects due to the inherent sparsity and limited spatial resolution. In this paper, we are among the early attempts to integrate LiDAR data with 2D images for unsupervised 3D detection and introduce a new method, dubbed LiDAR-2D Self-paced Learning (LiSe). We argue that RGB images serve as a valuable complement to LiDAR data, offering precise 2D localization cues, particularly when scarce LiDAR points are available for certain objects. Considering the unique characteristics of both modalities, our framework devises a self-paced learning pipeline that incorporates adaptive sampling and weak model aggregation strategies. The adaptive sampling strategy dynamically tunes the distribution of pseudo labels during training, countering the tendency of models to overfit easily detected samples, such as nearby and large-sized objects. By doing so, it ensures a balanced learning trajectory across varying object scales and distances. The weak model aggregation component consolidates the strengths of models trained under different pseudo label distributions, culminating in a robust and powerful final model. Experimental evaluations validate the efficacy of our proposed LiSe method, manifesting significant improvements of +7.1% AP$_{BEV}$ and +3.4% AP$_{3D}$ on nuScenes, and +8.3% AP$_{BEV}$ and +7.4% AP$_{3D}$ on Lyft compared to existing techniques.

Approaching Outside: Scaling Unsupervised 3D Object Detection from 2D Scene

TL;DR

LiSe addresses unsupervised 3D object detection in sparse LiDAR environments by integrating LiDAR with rich 2D scene cues through a self-paced learning framework. It generates pseudo labels from both LiDAR (multi-traversal) and open-vocabulary 2D detectors with SAM, then refines them via a distance-aware fusion strategy and an adaptive sampling mechanism that emphasizes long-tail, distant, and small objects. A weak model aggregation scheme combines diverse snapshots into a robust final model, yielding substantial improvements over prior methods on nuScenes and Lyft, including long-range BEV performance that can surpass fully supervised baselines. The work demonstrates that open-vocabulary 2D priors, when carefully fused with LiDAR data and guided by self-paced learning, significantly enhance unsupervised 3D detection and offer practical benefits for robust autonomous navigation.

Abstract

The unsupervised 3D object detection is to accurately detect objects in unstructured environments with no explicit supervisory signals. This task, given sparse LiDAR point clouds, often results in compromised performance for detecting distant or small objects due to the inherent sparsity and limited spatial resolution. In this paper, we are among the early attempts to integrate LiDAR data with 2D images for unsupervised 3D detection and introduce a new method, dubbed LiDAR-2D Self-paced Learning (LiSe). We argue that RGB images serve as a valuable complement to LiDAR data, offering precise 2D localization cues, particularly when scarce LiDAR points are available for certain objects. Considering the unique characteristics of both modalities, our framework devises a self-paced learning pipeline that incorporates adaptive sampling and weak model aggregation strategies. The adaptive sampling strategy dynamically tunes the distribution of pseudo labels during training, countering the tendency of models to overfit easily detected samples, such as nearby and large-sized objects. By doing so, it ensures a balanced learning trajectory across varying object scales and distances. The weak model aggregation component consolidates the strengths of models trained under different pseudo label distributions, culminating in a robust and powerful final model. Experimental evaluations validate the efficacy of our proposed LiSe method, manifesting significant improvements of +7.1% AP and +3.4% AP on nuScenes, and +8.3% AP and +7.4% AP on Lyft compared to existing techniques.
Paper Structure (11 sections, 4 equations, 5 figures, 6 tables)

This paper contains 11 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: We show typical limitations of LiDAR-based methods for unsupervised 3D object detection. Compared with the prevailing LiDAR-based method, i.e., MODEST you2022learning generally misses objects in the distance and small objects (left), our proposed method LiSe successfully recalls such objects (right). Best viewed in color: the green boxes are ground truth labels and the red boxes are predictions.
  • Figure 2: Illustration of the pseudo label generation process in LiSe, which distinctively harnesses information density from 2D scenes to complement LiDAR data. Our approach involves a generation method tailored for each modality to obtain LiDAR-based and image-based 3D boxes. In the LiDAR branch, an off-the-shelf multi-traversal based method generates pseudo labels, primarily covering near-range objects. Concurrently, the image branch uses a pretrained open-vocabulary 2D detector and Segment-Anything-Model to generate 2D contours from images, which are then mapped into 3D space. Following this, a distance-aware 3D boxes integration process fuses boxes from both LiDAR and image modalities. Notably, image-based boxes at longer ranges (e.g., > 10m) are merged with LiDAR-based 3D boxes. This integration addresses the limitations of LiDAR-based method in detecting long-range and small objects. The resulting pseudo labels are proficient in originally challenging samples (e.g., distant and small objects) for LiDAR-based methods, laying a solid foundation for enhancing detection model performance on these challenging cases.
  • Figure 3: Illustration of the self-paced learning process in LiSe. Initial distribution of objects and inference distribution after training are first calculated with the distance volume-based metric. Adaptive sampling strategy thus updates sampling rates for different objects based on changes in two distributions. We further consider weak model aggregation to combine newly trained model with previously aggregated model to obtain a stronger, more robust model for current round. Finally, we iteratively update distribution of pseudo labels and model weight for $T$ rounds to obtain the final model.
  • Figure 4: Statistical analysis of performance of different models. (a) Visualization comparison of the performances of various methods. This comparison shows superiority of LiSe over purely LiDAR-based methods. (b) Visualization of performance changes throughout the training process. The trend shows combination of adaptive sampling strategy with weak model aggregation ensures a stable and effective training process.
  • Figure 5: Visualization comparison between MODEST you2022learning, OYSTER zhang2023towards, LiSe (ours), and ground truth boxes. The overall results indicate LiSe is superior in detecting distant and small objects. Green boxes represent ground truth labels, red boxes indicate predictions and blue circles highlight differences in predictions.