Table of Contents
Fetching ...

UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

Jian Zou, Tianyu Huang, Guanglei Yang, Zhenhua Guo, Tao Luo, Chun-Mei Feng, Wangmeng Zuo

TL;DR

UniM$^2$AE tackles the challenge of multi-modal self-supervised learning for autonomous driving by fusing camera and LiDAR data in a unified 3D volume space that includes height information. It introduces the Multi-modal 3D Interaction Module (MMIM) to enable effective cross-modal interaction within this unified representation, followed by modality-specific decoding to reconstruct each input without sacrificing geometric or semantic fidelity. The approach yields consistent improvements on downstream tasks such as 3D object detection and BEV map segmentation on the nuScenes dataset, including data-efficient gains when limited labeled data is available. This work advances practical multi-modal pre-training by reducing information loss during fusion and enhancing cross-modal feature learning for real-world driving scenarios.

Abstract

Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. Despite integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable challenge in MAE methods addressing this integration due to the substantial disparity between the different modalities. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniM$^2$AE. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space to intricately marry the bird's eye view (BEV) with the height dimension. The extension allows for a precise representation of objects and reduces information loss when aligning multi-modal features. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniM$^2$AE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2\% NDS and 6.5\% mIoU, respectively. The code is available at https://github.com/hollow-503/UniM2AE.

UniM$^2$AE: Multi-modal Masked Autoencoders with Unified 3D Representation for 3D Perception in Autonomous Driving

TL;DR

UniMAE tackles the challenge of multi-modal self-supervised learning for autonomous driving by fusing camera and LiDAR data in a unified 3D volume space that includes height information. It introduces the Multi-modal 3D Interaction Module (MMIM) to enable effective cross-modal interaction within this unified representation, followed by modality-specific decoding to reconstruct each input without sacrificing geometric or semantic fidelity. The approach yields consistent improvements on downstream tasks such as 3D object detection and BEV map segmentation on the nuScenes dataset, including data-efficient gains when limited labeled data is available. This work advances practical multi-modal pre-training by reducing information loss during fusion and enhancing cross-modal feature learning for real-world driving scenarios.

Abstract

Masked Autoencoders (MAE) play a pivotal role in learning potent representations, delivering outstanding results across various 3D perception tasks essential for autonomous driving. In real-world driving scenarios, it's commonplace to deploy multiple sensors for comprehensive environment perception. Despite integrating multi-modal features from these sensors can produce rich and powerful features, there is a noticeable challenge in MAE methods addressing this integration due to the substantial disparity between the different modalities. This research delves into multi-modal Masked Autoencoders tailored for a unified representation space in autonomous driving, aiming to pioneer a more efficient fusion of two distinct modalities. To intricately marry the semantics inherent in images with the geometric intricacies of LiDAR point clouds, we propose UniMAE. This model stands as a potent yet straightforward, multi-modal self-supervised pre-training framework, mainly consisting of two designs. First, it projects the features from both modalities into a cohesive 3D volume space to intricately marry the bird's eye view (BEV) with the height dimension. The extension allows for a precise representation of objects and reduces information loss when aligning multi-modal features. Second, the Multi-modal 3D Interactive Module (MMIM) is invoked to facilitate the efficient inter-modal interaction during the interaction process. Extensive experiments conducted on the nuScenes Dataset attest to the efficacy of UniMAE, indicating enhancements in 3D object detection and BEV map segmentation by 1.2\% NDS and 6.5\% mIoU, respectively. The code is available at https://github.com/hollow-503/UniM2AE.
Paper Structure (30 sections, 6 equations, 3 figures, 5 tables)

This paper contains 30 sections, 6 equations, 3 figures, 5 tables.

Figures (3)

  • Figure 1: (a) Multi-modal frameworks chen2023pimae that align masked input before feature extraction but ignore feature characteristics from two branch. (b) UniM$^2$AE that interacts multi-modal features with unified representation.
  • Figure 2: Overview of UniM$^2$AE. The LiDAR branch voxelize the point cloud, while the camera branch divides multiple images into patches, both subsequently randomly masking their inputs. The tokens from the two branches are individually embedded and then passed through the Token-Volume projection, Multi-modal 3D Interaction Module, Volume-Token projection, and eventually the modality-specific decoder. Ultimately, we reconstruct the original inputs using the fused features.
  • Figure 3: Illustration of our Multi-modal 3D Interaction Module. We first concatenate the inputs $\left(F_V^{vol}, F_I^{vol}\right)$ and reshape it for the subsequent stacking 3D deformable self-attention blocks. After interaction, we split the output and project them back to feature token. This contributes more generalized and effective feature learning.