Table of Contents
Fetching ...

Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds

Mu Cai, Chenxu Luo, Yong Jae Lee, Xiaodong Yang

TL;DR

This work tackles the high labeling cost of LiDAR data for autonomous driving by proposing a cross-modal self-supervised learning framework that aligns LiDAR point clouds with synchronized images. It introduces instance-aware clustering to form semantically meaningful contrastive units and similarity-balanced sampling to curb negative pairs that share similar semantics, all guided by frozen image features as semantic anchors. The learning objective is an InfoNCE-based cross-modal loss that jointly aligns point and image representations for each instance-aware unit. Extensive experiments across four benchmarks demonstrate robust gains in 3D object detection and semantic segmentation, with strong data-efficient and transfer-learning performance, establishing cross-modal SSL as a highly effective paradigm for self-driving perception. These results highlight the practical impact of leveraging multi-sensor information to reduce labeling effort and improve downstream perception tasks in real-world driving scenarios, while providing reusable code and models for the community.

Abstract

3D perception in LiDAR point clouds is crucial for a self-driving vehicle to properly act in 3D environment. However, manually labeling point clouds is hard and costly. There has been a growing interest in self-supervised pre-training of 3D perception models. Following the success of contrastive learning in images, current methods mostly conduct contrastive pre-training on point clouds only. Yet an autonomous driving vehicle is typically supplied with multiple sensors including cameras and LiDAR. In this context, we systematically study single modality, cross-modality, and multi-modality for contrastive learning of point clouds, and show that cross-modality wins over other alternatives. In addition, considering the huge difference between the training sources in 2D images and 3D point clouds, it remains unclear how to design more effective contrastive units for LiDAR. We therefore propose the instance-aware and similarity-balanced contrastive units that are tailored for self-driving point clouds. Extensive experiments reveal that our approach achieves remarkable performance gains over various point cloud models across the downstream perception tasks of LiDAR based 3D object detection and 3D semantic segmentation on the four popular benchmarks including Waymo Open Dataset, nuScenes, SemanticKITTI and ONCE.

Cross-Modal Self-Supervised Learning with Effective Contrastive Units for LiDAR Point Clouds

TL;DR

This work tackles the high labeling cost of LiDAR data for autonomous driving by proposing a cross-modal self-supervised learning framework that aligns LiDAR point clouds with synchronized images. It introduces instance-aware clustering to form semantically meaningful contrastive units and similarity-balanced sampling to curb negative pairs that share similar semantics, all guided by frozen image features as semantic anchors. The learning objective is an InfoNCE-based cross-modal loss that jointly aligns point and image representations for each instance-aware unit. Extensive experiments across four benchmarks demonstrate robust gains in 3D object detection and semantic segmentation, with strong data-efficient and transfer-learning performance, establishing cross-modal SSL as a highly effective paradigm for self-driving perception. These results highlight the practical impact of leveraging multi-sensor information to reduce labeling effort and improve downstream perception tasks in real-world driving scenarios, while providing reusable code and models for the community.

Abstract

3D perception in LiDAR point clouds is crucial for a self-driving vehicle to properly act in 3D environment. However, manually labeling point clouds is hard and costly. There has been a growing interest in self-supervised pre-training of 3D perception models. Following the success of contrastive learning in images, current methods mostly conduct contrastive pre-training on point clouds only. Yet an autonomous driving vehicle is typically supplied with multiple sensors including cameras and LiDAR. In this context, we systematically study single modality, cross-modality, and multi-modality for contrastive learning of point clouds, and show that cross-modality wins over other alternatives. In addition, considering the huge difference between the training sources in 2D images and 3D point clouds, it remains unclear how to design more effective contrastive units for LiDAR. We therefore propose the instance-aware and similarity-balanced contrastive units that are tailored for self-driving point clouds. Extensive experiments reveal that our approach achieves remarkable performance gains over various point cloud models across the downstream perception tasks of LiDAR based 3D object detection and 3D semantic segmentation on the four popular benchmarks including Waymo Open Dataset, nuScenes, SemanticKITTI and ONCE.
Paper Structure (12 sections, 1 equation, 5 figures, 11 tables)

This paper contains 12 sections, 1 equation, 5 figures, 11 tables.

Figures (5)

  • Figure 1: (a) Our approach achieves consistent and significant performance gains compared to training from scratch and other state-of-the-art self-supervised learning methods for LiDAR point clouds across different fractions of fine-tuning data on Waymo Open Dataset. (b) Our comprehensive modality study finds that cross-modality (ours) is superior to single modality (and its enhanced version +) and multi-modality in terms of downstream performance and memory consumption of GPU (proportional to bubble area), while requiring moderate pre-training time.
  • Figure 2: Illustration of the single modality, cross-modality, and multi-modality for contrastive learning of LiDAR point clouds. PC1 and PC2 denote two independently augmented point clouds.
  • Figure 3: Overview of the proposed cross-modal contrastive pre-training framework. We uniformly sample initial contrastive units to maximally cover the point cloud scene. An unsupervised geometry clustering is introduced to generate the instance-aware contrastive units. Leveraging on the image features that are self-supervised pre-trained with rich semantics, we develop the similarity-balanced sampling to balance the contrastive objective by ruling out those units that are semantically close.
  • Figure 4: Illustration of instances such as vehicles and pedestrians discovered by the unsupervised clustering. Note that some instances are missing due to the imperfection of the simple rule based clustering.
  • Figure 5: Comparison of the contrastive accuracy of different modalities. If the similarity of a contrastive unit with its positive sample is higher than those with all negative samples, it is marked as a correct contrastive classification.