Table of Contents
Fetching ...

Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception

Xiaohao Xu, Ye Li, Tianyi Zhang, Jinrong Yang, Matthew Johnson-Roberson, Xiaonan Huang

TL;DR

A unified pretraining strategy, NeRF-Supervised Masked Auto Encoder (NS-MAE), which optimizes all modalities through a shared formulation and outperforms prior SOTA pre-training methods that employ separate strategies for each modality in BEV map segmentation under the label-efficient fine-tuning setting.

Abstract

Constructing large-scale labeled datasets for multi-modal perception model training in autonomous driving presents significant challenges. This has motivated the development of self-supervised pretraining strategies. However, existing pretraining methods mainly employ distinct approaches for each modality. In contrast, we focus on LiDAR-Camera 3D perception models and introduce a unified pretraining strategy, NeRF-Supervised Masked Auto Encoder (NS-MAE), which optimizes all modalities through a shared formulation. NS-MAE leverages NeRF's ability to encode both appearance and geometry, enabling efficient masked reconstruction of multi-modal data. Specifically, embeddings are extracted from corrupted LiDAR point clouds and images, conditioned on view directions and locations. Then, these embeddings are rendered into multi-modal feature maps from two crucial viewpoints for 3D driving perception: perspective and bird's-eye views. The original uncorrupted data serve as reconstruction targets for self-supervised learning. Extensive experiments demonstrate the superior transferability of NS-MAE across various 3D perception tasks under different fine-tuning settings. Notably, NS-MAE outperforms prior SOTA pre-training methods that employ separate strategies for each modality in BEV map segmentation under the label-efficient fine-tuning setting. Our code is publicly available at https://github.com/Xiaohao-Xu/Unified-Pretrain-AD/ .

Learning Shared RGB-D Fields: Unified Self-supervised Pre-training for Label-efficient LiDAR-Camera 3D Perception

TL;DR

A unified pretraining strategy, NeRF-Supervised Masked Auto Encoder (NS-MAE), which optimizes all modalities through a shared formulation and outperforms prior SOTA pre-training methods that employ separate strategies for each modality in BEV map segmentation under the label-efficient fine-tuning setting.

Abstract

Constructing large-scale labeled datasets for multi-modal perception model training in autonomous driving presents significant challenges. This has motivated the development of self-supervised pretraining strategies. However, existing pretraining methods mainly employ distinct approaches for each modality. In contrast, we focus on LiDAR-Camera 3D perception models and introduce a unified pretraining strategy, NeRF-Supervised Masked Auto Encoder (NS-MAE), which optimizes all modalities through a shared formulation. NS-MAE leverages NeRF's ability to encode both appearance and geometry, enabling efficient masked reconstruction of multi-modal data. Specifically, embeddings are extracted from corrupted LiDAR point clouds and images, conditioned on view directions and locations. Then, these embeddings are rendered into multi-modal feature maps from two crucial viewpoints for 3D driving perception: perspective and bird's-eye views. The original uncorrupted data serve as reconstruction targets for self-supervised learning. Extensive experiments demonstrate the superior transferability of NS-MAE across various 3D perception tasks under different fine-tuning settings. Notably, NS-MAE outperforms prior SOTA pre-training methods that employ separate strategies for each modality in BEV map segmentation under the label-efficient fine-tuning setting. Our code is publicly available at https://github.com/Xiaohao-Xu/Unified-Pretrain-AD/ .
Paper Structure (19 sections, 7 equations, 3 figures, 7 tables)

This paper contains 19 sections, 7 equations, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Motivation. (a) Existing methods predominantly focus on single-modal pre-training voxel-maeboulch2023alsoyang2023gd, or use distinct strategies for each modality in multi-modal models bevdistillli2022simipucalico. (b) Our unified self-supervised pre-training framework consistently applies a single approach across all modalities, improving the transferability of multi-modal representations.
  • Figure 2: Overview of the NS-MAE pre-training framework. Multi-modal representations are learned through self-supervised reconstruction of masked multi-modal inputs, using a differentiable neural rendering process.
  • Figure 3: Qualitative comparisons of BEV map segmentation results on nuScenes nuscenes$val$ with multi-modal perception model BEVFusion liu2022bevfusion.