Self-Supervised Pre-training with Combined Datasets for 3D Perception in Autonomous Driving
Shumin Wang, Zhuoran Yang, Lidian Wang, Zhipeng Tang, Heng Li, Lehan Pan, Sha Zhang, Jie Peng, Jianmin Ji, Yanyong Zhang
TL;DR
The paper tackles data scarcity and domain bias in 3D perception for autonomous driving by pre-training LiDAR-camera fusion models on large-scale unlabeled, heterogeneous datasets. It proposes a self-supervised framework that jointly learns image and point-cloud BEV representations from scratch, using a BEV-based contrastive loss $L_{CL}$ and an image MAE loss $L_{MAE}$, combined as $L_{All} = L_{MAE} + L_{CL}$, along with dataset-specific prompt adapters to mitigate cross-dataset bias. The method yields improvements across four downstream tasks—3D object detection, 3D object tracking, BEV segmentation, and occupancy prediction—and demonstrates robustness to domain shifts (e.g., NuScenes to Waymo) and scalability up to 250k frames. These results suggest that unlabeled, diverse data can substantially scale 3D perception models and move toward foundation-model-like capabilities in autonomous driving.
Abstract
The significant achievements of pre-trained models leveraging large volumes of data in the field of NLP and 2D vision inspire us to explore the potential of extensive data pre-training for 3D perception in autonomous driving. Toward this goal, this paper proposes to utilize massive unlabeled data from heterogeneous datasets to pre-train 3D perception models. We introduce a self-supervised pre-training framework that learns effective 3D representations from scratch on unlabeled data, combined with a prompt adapter based domain adaptation strategy to reduce dataset bias. The approach significantly improves model performance on downstream tasks such as 3D object detection, BEV segmentation, 3D object tracking, and occupancy prediction, and shows steady performance increase as the training data volume scales up, demonstrating the potential of continually benefit 3D perception models for autonomous driving. We will release the source code to inspire further investigations in the community.
