Table of Contents
Fetching ...

Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders

Chen Min, Xinli Xu, Dawei Zhao, Liang Xiao, Yiming Nie, Bin Dai

TL;DR

Occupancy-MAE addresses the scarcity of labelled 3D data for autonomous driving by pre-training on large-scale unlabeled outdoor LiDAR using a masked occupancy autoencoder. It introduces a range-aware voxel masking strategy and a binary occupancy prediction objective within a 3D sparse-convolution encoder–decoder, enabling learning of high-level semantic structure from partial observations. The approach yields consistent improvements across downstream tasks (3D object detection, semantic segmentation, multi-object tracking) and untapped gains in unsupervised domain adaptation, demonstrating data-efficient transfer. Overall, Occupancy-MAE offers a practical, scalable pre-training paradigm for voxel-based and pillar-based LiDAR perception with strong cross-task performance benefits.

Abstract

Current perception models in autonomous driving heavily rely on large-scale labelled 3D data, which is both costly and time-consuming to annotate. This work proposes a solution to reduce the dependence on labelled 3D training data by leveraging pre-training on large-scale unlabeled outdoor LiDAR point clouds using masked autoencoders (MAE). While existing masked point autoencoding methods mainly focus on small-scale indoor point clouds or pillar-based large-scale outdoor LiDAR data, our approach introduces a new self-supervised masked occupancy pre-training method called Occupancy-MAE, specifically designed for voxel-based large-scale outdoor LiDAR point clouds. Occupancy-MAE takes advantage of the gradually sparse voxel occupancy structure of outdoor LiDAR point clouds and incorporates a range-aware random masking strategy and a pretext task of occupancy prediction. By randomly masking voxels based on their distance to the LiDAR and predicting the masked occupancy structure of the entire 3D surrounding scene, Occupancy-MAE encourages the extraction of high-level semantic information to reconstruct the masked voxel using only a small number of visible voxels. Extensive experiments demonstrate the effectiveness of Occupancy-MAE across several downstream tasks. For 3D object detection, Occupancy-MAE reduces the labelled data required for car detection on the KITTI dataset by half and improves small object detection by approximately 2% in AP on the Waymo dataset. For 3D semantic segmentation, Occupancy-MAE outperforms training from scratch by around 2% in mIoU. For multi-object tracking, Occupancy-MAE enhances training from scratch by approximately 1% in terms of AMOTA and AMOTP. Codes are publicly available at https://github.com/chaytonmin/Occupancy-MAE.

Occupancy-MAE: Self-supervised Pre-training Large-scale LiDAR Point Clouds with Masked Occupancy Autoencoders

TL;DR

Occupancy-MAE addresses the scarcity of labelled 3D data for autonomous driving by pre-training on large-scale unlabeled outdoor LiDAR using a masked occupancy autoencoder. It introduces a range-aware voxel masking strategy and a binary occupancy prediction objective within a 3D sparse-convolution encoder–decoder, enabling learning of high-level semantic structure from partial observations. The approach yields consistent improvements across downstream tasks (3D object detection, semantic segmentation, multi-object tracking) and untapped gains in unsupervised domain adaptation, demonstrating data-efficient transfer. Overall, Occupancy-MAE offers a practical, scalable pre-training paradigm for voxel-based and pillar-based LiDAR perception with strong cross-task performance benefits.

Abstract

Current perception models in autonomous driving heavily rely on large-scale labelled 3D data, which is both costly and time-consuming to annotate. This work proposes a solution to reduce the dependence on labelled 3D training data by leveraging pre-training on large-scale unlabeled outdoor LiDAR point clouds using masked autoencoders (MAE). While existing masked point autoencoding methods mainly focus on small-scale indoor point clouds or pillar-based large-scale outdoor LiDAR data, our approach introduces a new self-supervised masked occupancy pre-training method called Occupancy-MAE, specifically designed for voxel-based large-scale outdoor LiDAR point clouds. Occupancy-MAE takes advantage of the gradually sparse voxel occupancy structure of outdoor LiDAR point clouds and incorporates a range-aware random masking strategy and a pretext task of occupancy prediction. By randomly masking voxels based on their distance to the LiDAR and predicting the masked occupancy structure of the entire 3D surrounding scene, Occupancy-MAE encourages the extraction of high-level semantic information to reconstruct the masked voxel using only a small number of visible voxels. Extensive experiments demonstrate the effectiveness of Occupancy-MAE across several downstream tasks. For 3D object detection, Occupancy-MAE reduces the labelled data required for car detection on the KITTI dataset by half and improves small object detection by approximately 2% in AP on the Waymo dataset. For 3D semantic segmentation, Occupancy-MAE outperforms training from scratch by around 2% in mIoU. For multi-object tracking, Occupancy-MAE enhances training from scratch by approximately 1% in terms of AMOTA and AMOTP. Codes are publicly available at https://github.com/chaytonmin/Occupancy-MAE.
Paper Structure (30 sections, 1 equation, 4 figures, 16 tables)

This paper contains 30 sections, 1 equation, 4 figures, 16 tables.

Figures (4)

  • Figure 1: $\textbf{Label-efficiency of our self-supervised pre-training}$. Occupancy-MAE outperforms training from scratch and achieves the same detection performance with fewer labelled data (about 50% for the car class and 75% for the pedestrian class).
  • Figure 2: The overall architecture of our Occupancy-MAE. We first transform the large-scale irregular LiDAR point clouds into volumetric representations, randomly mask the voxels according to their distance from the LiDAR sensor (i.e., range-aware masking strategy), then reconstruct the geometric occupancy structure of the general 3D world with an asymmetric autoencoder network. We adopt the 3D Spatially Sparse Convolutions second with positional encoding as the encoding backbone. We apply binary occupancy classification as the pretext task to distinguish whether the voxel contains points. After pre-training, the lightweight decoder is discarded, and the encoder is used to warm up the backbones of downstream tasks.
  • Figure 3: (a) Data efficiency of Occupancy-MAE. (b) Comparison of different masking strategies.
  • Figure 4: Qualitative results achieved on the KITTI test set. With the pre-training of our Occupancy-MAE, the 3D detector can learn more robust features to reduce missed and false detection.