Table of Contents
Fetching ...

Self-Supervised Pre-training with Combined Datasets for 3D Perception in Autonomous Driving

Shumin Wang, Zhuoran Yang, Lidian Wang, Zhipeng Tang, Heng Li, Lehan Pan, Sha Zhang, Jie Peng, Jianmin Ji, Yanyong Zhang

TL;DR

The paper tackles data scarcity and domain bias in 3D perception for autonomous driving by pre-training LiDAR-camera fusion models on large-scale unlabeled, heterogeneous datasets. It proposes a self-supervised framework that jointly learns image and point-cloud BEV representations from scratch, using a BEV-based contrastive loss $L_{CL}$ and an image MAE loss $L_{MAE}$, combined as $L_{All} = L_{MAE} + L_{CL}$, along with dataset-specific prompt adapters to mitigate cross-dataset bias. The method yields improvements across four downstream tasks—3D object detection, 3D object tracking, BEV segmentation, and occupancy prediction—and demonstrates robustness to domain shifts (e.g., NuScenes to Waymo) and scalability up to 250k frames. These results suggest that unlabeled, diverse data can substantially scale 3D perception models and move toward foundation-model-like capabilities in autonomous driving.

Abstract

The significant achievements of pre-trained models leveraging large volumes of data in the field of NLP and 2D vision inspire us to explore the potential of extensive data pre-training for 3D perception in autonomous driving. Toward this goal, this paper proposes to utilize massive unlabeled data from heterogeneous datasets to pre-train 3D perception models. We introduce a self-supervised pre-training framework that learns effective 3D representations from scratch on unlabeled data, combined with a prompt adapter based domain adaptation strategy to reduce dataset bias. The approach significantly improves model performance on downstream tasks such as 3D object detection, BEV segmentation, 3D object tracking, and occupancy prediction, and shows steady performance increase as the training data volume scales up, demonstrating the potential of continually benefit 3D perception models for autonomous driving. We will release the source code to inspire further investigations in the community.

Self-Supervised Pre-training with Combined Datasets for 3D Perception in Autonomous Driving

TL;DR

The paper tackles data scarcity and domain bias in 3D perception for autonomous driving by pre-training LiDAR-camera fusion models on large-scale unlabeled, heterogeneous datasets. It proposes a self-supervised framework that jointly learns image and point-cloud BEV representations from scratch, using a BEV-based contrastive loss and an image MAE loss , combined as , along with dataset-specific prompt adapters to mitigate cross-dataset bias. The method yields improvements across four downstream tasks—3D object detection, 3D object tracking, BEV segmentation, and occupancy prediction—and demonstrates robustness to domain shifts (e.g., NuScenes to Waymo) and scalability up to 250k frames. These results suggest that unlabeled, diverse data can substantially scale 3D perception models and move toward foundation-model-like capabilities in autonomous driving.

Abstract

The significant achievements of pre-trained models leveraging large volumes of data in the field of NLP and 2D vision inspire us to explore the potential of extensive data pre-training for 3D perception in autonomous driving. Toward this goal, this paper proposes to utilize massive unlabeled data from heterogeneous datasets to pre-train 3D perception models. We introduce a self-supervised pre-training framework that learns effective 3D representations from scratch on unlabeled data, combined with a prompt adapter based domain adaptation strategy to reduce dataset bias. The approach significantly improves model performance on downstream tasks such as 3D object detection, BEV segmentation, 3D object tracking, and occupancy prediction, and shows steady performance increase as the training data volume scales up, demonstrating the potential of continually benefit 3D perception models for autonomous driving. We will release the source code to inspire further investigations in the community.

Paper Structure

This paper contains 36 sections, 6 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: The pre-train-then-fine-tune framework for multi-modal 3D perception integrating image and point cloud data. The point clouds and partially masked images are encoded into BEV feature tokens. We use contrastive loss between the BEV map of the two modalities to make them co-evolve, while use mae loss of recovering the masked portion of images to help capturing their semantic features.
  • Figure 2: Multi-dataset training strategy with prompt adapters. We set tunable prompts for each dataset and mix the dataset during training. The prompts are injected into the backbones with MLP adapters.
  • Figure 3: BEV heatmaps of image modality when testing the models by applying various prompts or using the model without prompt training. The x-axis represents the forward and backward direction of the vehicle. The data frame is from Lyft dataset.
  • Figure 4: BEV heatmaps of image modality when testing the models by applying various prompts or using the model without prompt training.