Table of Contents
Fetching ...

Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection

Mehar Khurana, Neehar Peri, James Hays, Deva Ramanan

TL;DR

This work proposes a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data and shows that image-based shelf-supervision is helpful for training LiDAR-only, RGB-only and multi-modal (RGB + LiDAR) detectors.

Abstract

State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised learning from the image domain to point clouds (such as contrastive learning). However, publicly available 3D datasets are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting their effectiveness. We do note, however, that such 3D data is naturally collected in a multimodal fashion, often paired with images. Rather than pre-training with only self-supervised objectives, we argue that it is better to bootstrap point cloud representations using image-based foundation models trained on internet-scale data. Specifically, we propose a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data. Pre-training 3D detectors with such pseudo-labels yields significantly better semi-supervised detection accuracy than prior self-supervised pretext tasks. Importantly, we show that image-based shelf-supervision is helpful for training LiDAR-only, RGB-only and multi-modal (RGB + LiDAR) detectors. We demonstrate the effectiveness of our approach on nuScenes and WOD, significantly improving over prior work in limited data settings. Our code is available at https://github.com/meharkhurana03/cm3d

Shelf-Supervised Cross-Modal Pre-Training for 3D Object Detection

TL;DR

This work proposes a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data and shows that image-based shelf-supervision is helpful for training LiDAR-only, RGB-only and multi-modal (RGB + LiDAR) detectors.

Abstract

State-of-the-art 3D object detectors are often trained on massive labeled datasets. However, annotating 3D bounding boxes remains prohibitively expensive and time-consuming, particularly for LiDAR. Instead, recent works demonstrate that self-supervised pre-training with unlabeled data can improve detection accuracy with limited labels. Contemporary methods adapt best-practices for self-supervised learning from the image domain to point clouds (such as contrastive learning). However, publicly available 3D datasets are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting their effectiveness. We do note, however, that such 3D data is naturally collected in a multimodal fashion, often paired with images. Rather than pre-training with only self-supervised objectives, we argue that it is better to bootstrap point cloud representations using image-based foundation models trained on internet-scale data. Specifically, we propose a shelf-supervised approach (e.g. supervised with off-the-shelf image foundation models) for generating zero-shot 3D bounding boxes from paired RGB and LiDAR data. Pre-training 3D detectors with such pseudo-labels yields significantly better semi-supervised detection accuracy than prior self-supervised pretext tasks. Importantly, we show that image-based shelf-supervision is helpful for training LiDAR-only, RGB-only and multi-modal (RGB + LiDAR) detectors. We demonstrate the effectiveness of our approach on nuScenes and WOD, significantly improving over prior work in limited data settings. Our code is available at https://github.com/meharkhurana03/cm3d
Paper Structure (18 sections, 9 figures, 12 tables)

This paper contains 18 sections, 9 figures, 12 tables.

Figures (9)

  • Figure 1: Cross-Modal 3D Detection Distillation with Vision-Language Models. Existing datasets used for 3D representation learning are considerably smaller and less diverse than those used for image-based self-supervised learning, limiting the effectiveness of pre-training. Therefore, we propose a simple approach for transferring object-centric priors from vision-language models to LiDAR. Specifically, we project 3D LiDAR points onto 2D instance segmentation masks to generate zero-shot 3D bounding boxes wilson20203d that can be used for pre-training 3D detectors.
  • Figure 2: Overview. Given unlabeled, calibrated, and paired RGB images and LiDAR sweeps, we generate 3D bounding boxes psuedo-labels using foundational 2D priors from VLMs, map priors, and geometric clustering. We describe our unprojection step further in Fig. \ref{['fig:label_pipeline']}. Using these pseudo-labels, we can train LiDAR-only, RGB-only, or multi-modal 3D detectors.
  • Figure 3: Unprojecting 2D Foundational Priors to 3D. First, we prompt an open-vocabulary 2D detector (e.g., Detic zhou2022detecting) with a class name (e.g., car) to generate 2D box proposals. Next, we prompt SAM kirillov2023segment with the predicted 2D bounding boxes to generate high-quality instance segmentation masks. We then generate an oriented 3D cuboid using the set of LiDAR points that project to a given 2D instance mask. Specifically, we define the center of the cuboid to be the medoid of the LiDAR points, the dimensions (length, width, height) to be a fixed shape prior (similar to an anchor box) as reported by ChatGPT when prompted with the class name, and the orientation to be aligned with lane geometry provided from an HD map.
  • Figure 4: Qualitative Results of Pseudo-Labels. We visualize pseudo-labels (pink) and ground-truth labels (green) across all 10 object classes on the nuScenes val-set. In ( a), our pseudo-labels accurately estimate location, cuboid size, and orientation, demonstrating the general effectiveness of medoid compensation and map-based orientation estimation. In ( b), we find that CM3D often misses heavily-occluded objects. This is unsurprising because our method relies on accurate RGB-based detections, which often fail with heavy occlusions. In ( c), our map-based orientation estimation fails when the predicted object is not oriented in the direction of any lane. For example, the incorrect orientation of the car turning into the intersection (not aligned to any nearby lanes) illustrates the limitations of our approach. In both ( d) and ( f), we are unable to label several barriers. We attribute these missed detections to the ambiguity of the class name barrier. Notably, a barrier in nuScenes may not be the same as barrier as defined in internet pre-training data madan2023revisiting. In ( d), ( e), and ( f), we produce duplicate boxes for the same instances, indicating a failure of NMS.
  • Figure 5: More Qualitative Results. We present additional qualitative results comparing the predictions from BEVFusion trained from scratch on 5% data (top), BEVFusion + CM3D (middle), and BEVFusion + CM3D w/ Self-Training (bottom). Ground truth bounding boxes are shown in green, and predictions are shown in blue. Across all three examples, we find that the model trained from scratch produces many high confidence false positives. Pre-training BEVFusion with CM3D pseudo-labels improves performance by reducing the number of false positives. However, many of the predictions have incorrect orientation estimates. Lastly, we find that self-training improves orientation estimation.
  • ...and 4 more figures