Table of Contents
Fetching ...

Resilient Sensor Fusion under Adverse Sensor Failures via Multi-Modal Expert Fusion

Konyul Park, Yecheol Kim, Daehun Kim, Jun Won Choi

TL;DR

MoME addresses robustness gaps in LiDAR-camera fusion for 3D object detection under adverse sensor conditions by decoupling modalities with a Mixture of Experts framework. It employs three parallel decoders (LiDAR, Camera, LiDAR-Camera) and an Adaptive Query Router to assign each object query to the most suitable expert via a locality-aware attention mechanism, while keeping decoding cost near that of a single decoder. Training with synthetic sensor dropouts and a three-stage regime enables robust per-query routing, achieving state-of-the-art performance on nuScenes-R and nuScenes-C benchmarks, including significant gains under LiDAR-drop, camera-drop, and limited FOV scenarios. The approach offers practical impact for autonomous driving by providing robust perception without extensive computational overhead, and it provides a blueprint for efficient, failure-aware multi-modal fusion in real-world deployments.

Abstract

Modern autonomous driving perception systems utilize complementary multi-modal sensors, such as LiDAR and cameras. Although sensor fusion architectures enhance performance in challenging environments, they still suffer significant performance drops under severe sensor failures, such as LiDAR beam reduction, LiDAR drop, limited field of view, camera drop, and occlusion. This limitation stems from inter-modality dependencies in current sensor fusion frameworks. In this study, we introduce an efficient and robust LiDAR-camera 3D object detector, referred to as MoME, which can achieve robust performance through a mixture of experts approach. Our MoME fully decouples modality dependencies using three parallel expert decoders, which use camera features, LiDAR features, or a combination of both to decode object queries, respectively. We propose Multi-Expert Decoding (MED) framework, where each query is decoded selectively using one of three expert decoders. MoME utilizes an Adaptive Query Router (AQR) to select the most appropriate expert decoder for each query based on the quality of camera and LiDAR features. This ensures that each query is processed by the best-suited expert, resulting in robust performance across diverse sensor failure scenarios. We evaluated the performance of MoME on the nuScenes-R benchmark. Our MoME achieved state-of-the-art performance in extreme weather and sensor failure conditions, significantly outperforming the existing models across various sensor failure scenarios.

Resilient Sensor Fusion under Adverse Sensor Failures via Multi-Modal Expert Fusion

TL;DR

MoME addresses robustness gaps in LiDAR-camera fusion for 3D object detection under adverse sensor conditions by decoupling modalities with a Mixture of Experts framework. It employs three parallel decoders (LiDAR, Camera, LiDAR-Camera) and an Adaptive Query Router to assign each object query to the most suitable expert via a locality-aware attention mechanism, while keeping decoding cost near that of a single decoder. Training with synthetic sensor dropouts and a three-stage regime enables robust per-query routing, achieving state-of-the-art performance on nuScenes-R and nuScenes-C benchmarks, including significant gains under LiDAR-drop, camera-drop, and limited FOV scenarios. The approach offers practical impact for autonomous driving by providing robust perception without extensive computational overhead, and it provides a blueprint for efficient, failure-aware multi-modal fusion in real-world deployments.

Abstract

Modern autonomous driving perception systems utilize complementary multi-modal sensors, such as LiDAR and cameras. Although sensor fusion architectures enhance performance in challenging environments, they still suffer significant performance drops under severe sensor failures, such as LiDAR beam reduction, LiDAR drop, limited field of view, camera drop, and occlusion. This limitation stems from inter-modality dependencies in current sensor fusion frameworks. In this study, we introduce an efficient and robust LiDAR-camera 3D object detector, referred to as MoME, which can achieve robust performance through a mixture of experts approach. Our MoME fully decouples modality dependencies using three parallel expert decoders, which use camera features, LiDAR features, or a combination of both to decode object queries, respectively. We propose Multi-Expert Decoding (MED) framework, where each query is decoded selectively using one of three expert decoders. MoME utilizes an Adaptive Query Router (AQR) to select the most appropriate expert decoder for each query based on the quality of camera and LiDAR features. This ensures that each query is processed by the best-suited expert, resulting in robust performance across diverse sensor failure scenarios. We evaluated the performance of MoME on the nuScenes-R benchmark. Our MoME achieved state-of-the-art performance in extreme weather and sensor failure conditions, significantly outperforming the existing models across various sensor failure scenarios.

Paper Structure

This paper contains 22 sections, 4 equations, 7 figures, 10 tables, 2 algorithms.

Figures (7)

  • Figure 1: Comparisons of several LiDAR-camera fusion methods. (a) Single decoder, (b) Separate decoder, and (c) Proposed MoME. MoME utilizes three parallel decoders, each processing camera features, LiDAR features, or a combination of both, to decouple dependencies between the modalities. It adaptively routes each query to the most suitable decoder based on the quality of the camera and LiDAR features.
  • Figure 2: Comparison of predictions and performance of different fusion methods. Left: Examples of detection results for a single object. For visualization, red boxes indicate predictions from the LiDAR-camera decoder, blue boxes from the LiDAR decoder, and orange boxes from the camera decoder, with ground truth (GT) shown in black. Right: NDS performance across various scenarios—Clean, LiDAR drop, Limited FOV, and Camera drop. (a) Under sensor failure, predictions are prone to error due to the corrupted modality. (b) Due to uncalibrated confidence score scales, lower-quality predictions may have higher confidence scores. (c) The proposed method selects the optimal decoder for decoding each object query.
  • Figure 3: Overall structure of MoME. MoME utilizes three expert decoders, each specialized for processing LiDAR, camera, or LiDAR-camera features. The AQR dynamically assigns each object query to the most suitable expert decoder in an adaptive manner. The operation of AQR relies on local features filtered through the Local Attention Mask.
  • Figure 4: Qualitative results under various sensor failure scenarios. Comparison of detection results under four challenging scenarios: LiDAR drop, Camera drop, Limited FOV, and Occlusion. MoME exhibits higher robustness and consistent results compared to CMT cmt
  • Figure 5: Qualitative results under various sensor failure scenarios. Comparison of detection results between MoME and CMT cmt under six sensor failure scenarios: Beam Reduction, LiDAR Drop, Limited FOV, Object Failure, View Drop, and Occlusion. The results demonstrate MoME's detection capabilities across these challenging conditions.
  • ...and 2 more figures