Table of Contents
Fetching ...

Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds

Mohamed Abdelsamad, Michael Ulrich, Claudius Gläser, Abhinav Valada

TL;DR

NOMAE tackles the challenge of self-supervised learning on sparse LiDAR point clouds by introducing neighborhood-based occupancy reconstruction across multiple scales, avoiding leakage and enabling efficiency at high voxel resolutions. The method combines a sparse PTv3 encoder, a lightweight multi-scale upsampling module, and localized neighboring decoders with a hierarchical mask generator to supervise occupancy in neighborhoods around visible voxels. It achieves state-of-the-art results on nuScenes and Waymo for semantic segmentation and 3D object detection, supported by extensive ablations showing the benefits of multi-scale supervision and localized reconstruction. The framework offers improved sample efficiency, compatibility with existing 3D architectures, and a practical pathway toward robust, scalable SSL for automotive perception tasks.

Abstract

Masked autoencoders (MAE) have shown tremendous potential for self-supervised learning (SSL) in vision and beyond. However, point clouds from LiDARs used in automated driving are particularly challenging for MAEs since large areas of the 3D volume are empty. Consequently, existing work suffers from leaking occupancy information into the decoder and has significant computational complexity, thereby limiting the SSL pre-training to only 2D bird's eye view encoders in practice. In this work, we propose the novel neighborhood occupancy MAE (NOMAE) that overcomes the aforementioned challenges by employing masked occupancy reconstruction only in the neighborhood of non-masked voxels. We incorporate voxel masking and occupancy reconstruction at multiple scales with our proposed hierarchical mask generation technique to capture features of objects of different sizes in the point cloud. NOMAEs are extremely flexible and can be directly employed for SSL in existing 3D architectures. We perform extensive evaluations on the nuScenes and Waymo Open datasets for the downstream perception tasks of semantic segmentation and 3D object detection, comparing with both discriminative and generative SSL methods. The results demonstrate that NOMAE sets the new state-of-the-art on multiple benchmarks for multiple point cloud perception tasks.

Multi-Scale Neighborhood Occupancy Masked Autoencoder for Self-Supervised Learning in LiDAR Point Clouds

TL;DR

NOMAE tackles the challenge of self-supervised learning on sparse LiDAR point clouds by introducing neighborhood-based occupancy reconstruction across multiple scales, avoiding leakage and enabling efficiency at high voxel resolutions. The method combines a sparse PTv3 encoder, a lightweight multi-scale upsampling module, and localized neighboring decoders with a hierarchical mask generator to supervise occupancy in neighborhoods around visible voxels. It achieves state-of-the-art results on nuScenes and Waymo for semantic segmentation and 3D object detection, supported by extensive ablations showing the benefits of multi-scale supervision and localized reconstruction. The framework offers improved sample efficiency, compatibility with existing 3D architectures, and a practical pathway toward robust, scalable SSL for automotive perception tasks.

Abstract

Masked autoencoders (MAE) have shown tremendous potential for self-supervised learning (SSL) in vision and beyond. However, point clouds from LiDARs used in automated driving are particularly challenging for MAEs since large areas of the 3D volume are empty. Consequently, existing work suffers from leaking occupancy information into the decoder and has significant computational complexity, thereby limiting the SSL pre-training to only 2D bird's eye view encoders in practice. In this work, we propose the novel neighborhood occupancy MAE (NOMAE) that overcomes the aforementioned challenges by employing masked occupancy reconstruction only in the neighborhood of non-masked voxels. We incorporate voxel masking and occupancy reconstruction at multiple scales with our proposed hierarchical mask generation technique to capture features of objects of different sizes in the point cloud. NOMAEs are extremely flexible and can be directly employed for SSL in existing 3D architectures. We perform extensive evaluations on the nuScenes and Waymo Open datasets for the downstream perception tasks of semantic segmentation and 3D object detection, comparing with both discriminative and generative SSL methods. The results demonstrate that NOMAE sets the new state-of-the-art on multiple benchmarks for multiple point cloud perception tasks.

Paper Structure

This paper contains 28 sections, 5 equations, 9 figures, 17 tables.

Figures (9)

  • Figure 1: NOMAE enables masking and reconstructing occupancy as a self-supervised pretext task for large-scale point clouds. It limits the reconstruction of masked voxels to the neighborhood of visible voxels and reconstructs the masked occupancy at multiple scales. NOMAE achieves state-of-the-art performance on nuScenes semantic segmentation, Waymo semantic segmentation, and nuScenes object detection tasks, outperforming existing self-supervised methods as well as transformer methods .
  • Figure 2: Overview of the proposed NOMAE approach. The input point cloud is first voxelized and masked by the hierarchal mask generator. The encoder $\mathbb{E}$ processes the visible voxels $\mathcal{V}_{\text{v}}$ to yield a hierarchical representation. The upsampler $\mathbb{M}_\text{u}$ then fuses the multi-scale representations to capture high-level features at each scale. For each feature scale, a separate neighboring decoder predicts occupancy in $\mathcal{V}_{\text{n}}$, corresponding to the immediate neighborhood of the visible voxels. The combination of independent learning tasks across multiple feature scales and the localized predictions by the neighboring decoders enables learning representations that are well-suited for 3D point clouds.
  • Figure 3: Illustration of multiscale pretext (MSP) and hierarchical mask generation (HMG).
  • Figure 4: Size (number of voxels) of the reconstructed neighborhood $n$ around visible voxels $\mathcal{V}_\text{v}$, to create $\mathcal{V}_\text{n}$ in the proposed pretext task. We observe that the downstream NonLP semantic segmentation peaks at $n=9$. Note that $n\rightarrow\infty$ corresponds to the method of min2022OccupancyMAESP.
  • Figure 5: NonLP performance over the masking ratio $r_\text{t}$ on the nuScenes and Waymo datasets. We observe that the optimal $r_\text{t}(0)$ is $70\%$ for the nuScenes and $85\%$ for the Waymo Open Dataset. Our interpretation is that Waymo requires a higher masking ratio due to the higher density of the LiDAR point cloud.
  • ...and 4 more figures