Table of Contents
Fetching ...

Robust Multimodal 3D Object Detection via Modality-Agnostic Decoding and Proximity-based Modality Ensemble

Juhan Cha, Minseok Joo, Jihwan Park, Sanghyeok Lee, Injae Kim, Hyunwoo J. Kim

TL;DR

This work tackles robust multimodal 3D object detection by addressing LiDAR over-reliance and negative fusion between LiDAR and camera data. It introduces MEFormer, consisting of Modality-Agnostic Decoding (MOAD) and Proximity-based Modality Ensemble (PME), which enable a shared transformer decoder to extract geometric and semantic information from each modality and adaptively fuse predictions. MOAD trains a single decoder to produce modality-specific and joint representations, while PME uses center-distance bias in cross-attention to mitigate noise transfer and select favorable modalities by environment. On nuScenes, MEFormer achieves state-of-the-art results (NDS up to 74.3% and mAP up to 72.2%) and demonstrates robustness under sensor missing and adverse environmental conditions, highlighting its practical impact for reliable autonomous driving systems.

Abstract

Recent advancements in 3D object detection have benefited from multi-modal information from the multi-view cameras and LiDAR sensors. However, the inherent disparities between the modalities pose substantial challenges. We observe that existing multi-modal 3D object detection methods heavily rely on the LiDAR sensor, treating the camera as an auxiliary modality for augmenting semantic details. This often leads to not only underutilization of camera data but also significant performance degradation in scenarios where LiDAR data is unavailable. Additionally, existing fusion methods overlook the detrimental impact of sensor noise induced by environmental changes, on detection performance. In this paper, we propose MEFormer to address the LiDAR over-reliance problem by harnessing critical information for 3D object detection from every available modality while concurrently safeguarding against corrupted signals during the fusion process. Specifically, we introduce Modality Agnostic Decoding (MOAD) that extracts geometric and semantic features with a shared transformer decoder regardless of input modalities and provides promising improvement with a single modality as well as multi-modality. Additionally, our Proximity-based Modality Ensemble (PME) module adaptively utilizes the strengths of each modality depending on the environment while mitigating the effects of a noisy sensor. Our MEFormer achieves state-of-the-art performance of 73.9% NDS and 71.5% mAP in the nuScenes validation set. Extensive analyses validate that our MEFormer improves robustness against challenging conditions such as sensor malfunctions or environmental changes. The source code is available at https://github.com/hanchaa/MEFormer

Robust Multimodal 3D Object Detection via Modality-Agnostic Decoding and Proximity-based Modality Ensemble

TL;DR

This work tackles robust multimodal 3D object detection by addressing LiDAR over-reliance and negative fusion between LiDAR and camera data. It introduces MEFormer, consisting of Modality-Agnostic Decoding (MOAD) and Proximity-based Modality Ensemble (PME), which enable a shared transformer decoder to extract geometric and semantic information from each modality and adaptively fuse predictions. MOAD trains a single decoder to produce modality-specific and joint representations, while PME uses center-distance bias in cross-attention to mitigate noise transfer and select favorable modalities by environment. On nuScenes, MEFormer achieves state-of-the-art results (NDS up to 74.3% and mAP up to 72.2%) and demonstrates robustness under sensor missing and adverse environmental conditions, highlighting its practical impact for reliable autonomous driving systems.

Abstract

Recent advancements in 3D object detection have benefited from multi-modal information from the multi-view cameras and LiDAR sensors. However, the inherent disparities between the modalities pose substantial challenges. We observe that existing multi-modal 3D object detection methods heavily rely on the LiDAR sensor, treating the camera as an auxiliary modality for augmenting semantic details. This often leads to not only underutilization of camera data but also significant performance degradation in scenarios where LiDAR data is unavailable. Additionally, existing fusion methods overlook the detrimental impact of sensor noise induced by environmental changes, on detection performance. In this paper, we propose MEFormer to address the LiDAR over-reliance problem by harnessing critical information for 3D object detection from every available modality while concurrently safeguarding against corrupted signals during the fusion process. Specifically, we introduce Modality Agnostic Decoding (MOAD) that extracts geometric and semantic features with a shared transformer decoder regardless of input modalities and provides promising improvement with a single modality as well as multi-modality. Additionally, our Proximity-based Modality Ensemble (PME) module adaptively utilizes the strengths of each modality depending on the environment while mitigating the effects of a noisy sensor. Our MEFormer achieves state-of-the-art performance of 73.9% NDS and 71.5% mAP in the nuScenes validation set. Extensive analyses validate that our MEFormer improves robustness against challenging conditions such as sensor malfunctions or environmental changes. The source code is available at https://github.com/hanchaa/MEFormer
Paper Structure (19 sections, 9 equations, 5 figures, 6 tables)

This paper contains 19 sections, 9 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Left: Comparison of performance drop in sensor missing scenarios. MEFormer shows the smallest performance degradation compared to previous works. Specifically, CMT yan2023cmt shows 32% mAP drop and BEVFusion liu2023bevfusion shows 68.3% mAP drop while ours shows only 29% mAP drop in camera only scenario. Right: Illustration of negative fusion. Although the prediction by a single modality (e.g., Camera) is correct, the multimodal predictions are often negatively affected by inaccurate unimodal signals, resulting in misclassification.
  • Figure 2: The overall architecture of MEFormer. In our framework, we employ two modalities: one dedicated to the image (camera) and the other to the point clouds (LiDAR). The camera and LiDAR backbones simultaneously extract the feature maps from both the image and point clouds. Then, three modality decoding branches process the initial object query $Q$. Each modality decoding branch has the transformer decoder $f$ which shares parameters with all other modality decoding branches. Each uses different combinations of modalities as key and value, e.g., LiDAR + camera (LC), LiDAR (L), and camera (C), resulting in the box features $Z_{LC}$, $Z_{L}$, and $Z_{C}$, respectively. During training, the predicted boxes of each modality decoding branch are separately supervised by ground truth boxes. Finally, the PME module based on cross-attention acquires $Z_{LC}$ for query and $Z_{LC}, Z_L$, and $Z_C$ for key and value, generating the final ensembled box features $Z_e$. The predicted boxes derived from $Z_e$ are also supervised by the ground truth boxes.
  • Figure 3: Illustration of our proximity-based modality ensemble module. PME takes box features $Z_{LC}$ as query and $Z_{LC}, Z_L$, and $Z_C$ as key and value. To reduce the interaction between irrelevant boxes, we calculate the attention bias $M$ based on the center distance between the predicted boxes. Then, we add attention bias $M$ to the attention logit before applying the softmax function.
  • Figure 4: Qualitative results of MOAD in nuScenes validation set. With MOAD detects a truck with the help of geometric information in the camera modality while without MOAD fails.
  • Figure 5: Qualitative results on multi-view images and BEV space at night time in nuScenes validation set. MEFormer shows promising detection results for objects that are difficult to identify with the cameras. We provide additional ground truth boxes for those objects to recognize easily in images.