Table of Contents
Fetching ...

ModalPatch: A Plug-and-Play Module for Robust Multi-Modal 3D Object Detection under Modality Drop

Shuangzhi Li, Lei Ma, Xingyu Li

TL;DR

ModalPatch is the first plug-and-play module designed to enable robust detection under arbitrary modality-drop scenarios, and an uncertainty-guided cross-modality fusion strategy that dynamically estimates the reliability of compensated features, suppressing biased signals while reinforcing informative ones.

Abstract

Multi-modal 3D object detection is pivotal for autonomous driving, integrating complementary sensors like LiDAR and cameras. However, its real-world reliability is challenged by transient data interruptions and missing, where modalities can momentarily drop due to hardware glitches, adverse weather, or occlusions. This poses a critical risk, especially during a simultaneous modality drop, where the vehicle is momentarily blind. To address this problem, we introduce ModalPatch, the first plug-and-play module designed to enable robust detection under arbitrary modality-drop scenarios. Without requiring architectural changes or retraining, ModalPatch can be seamlessly integrated into diverse detection frameworks. Technically, ModalPatch leverages the temporal nature of sensor data for perceptual continuity, using a history-based module to predict and compensate for transiently unavailable features. To improve the fidelity of the predicted features, we further introduce an uncertainty-guided cross-modality fusion strategy that dynamically estimates the reliability of compensated features, suppressing biased signals while reinforcing informative ones. Extensive experiments show that ModalPatch consistently enhances both robustness and accuracy of state-of-the-art 3D object detectors under diverse modality-drop conditions.

ModalPatch: A Plug-and-Play Module for Robust Multi-Modal 3D Object Detection under Modality Drop

TL;DR

ModalPatch is the first plug-and-play module designed to enable robust detection under arbitrary modality-drop scenarios, and an uncertainty-guided cross-modality fusion strategy that dynamically estimates the reliability of compensated features, suppressing biased signals while reinforcing informative ones.

Abstract

Multi-modal 3D object detection is pivotal for autonomous driving, integrating complementary sensors like LiDAR and cameras. However, its real-world reliability is challenged by transient data interruptions and missing, where modalities can momentarily drop due to hardware glitches, adverse weather, or occlusions. This poses a critical risk, especially during a simultaneous modality drop, where the vehicle is momentarily blind. To address this problem, we introduce ModalPatch, the first plug-and-play module designed to enable robust detection under arbitrary modality-drop scenarios. Without requiring architectural changes or retraining, ModalPatch can be seamlessly integrated into diverse detection frameworks. Technically, ModalPatch leverages the temporal nature of sensor data for perceptual continuity, using a history-based module to predict and compensate for transiently unavailable features. To improve the fidelity of the predicted features, we further introduce an uncertainty-guided cross-modality fusion strategy that dynamically estimates the reliability of compensated features, suppressing biased signals while reinforcing informative ones. Extensive experiments show that ModalPatch consistently enhances both robustness and accuracy of state-of-the-art 3D object detectors under diverse modality-drop conditions.
Paper Structure (14 sections, 10 equations, 6 figures, 5 tables)

This paper contains 14 sections, 10 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: (a) The proposed ModalPatch framework compensates for arbitrary modality drops (LiDAR or camera) and can be seamlessly integrated into existing detectors without retraining them. (b) Performance boost achieved by ModalPatch under the 30% modality drop rate for various detectors.
  • Figure 2: Overview of the proposed ModalPatch module. Given multi-modal inputs (e.g., point clouds by LiDAR and images by camera), features are first extracted by the frozen backbone. To address possible modality-drop scenarios, the plug-and-play ModalPatch introduces two modules: (1) a history-based feature prediction module, which leverages temporal information from past frames to predict current features and provide initial compensation; and (2) an uncertainty-based cross-modality fusion module, which estimates spatial uncertainty and employs cross-modality complementary information to enhance compensated features.
  • Figure 3: History-based temporal transformer, taking learnable queries and the history memory bank as inputs to generate compensated features.
  • Figure 4: Uncertainty-guided cross-modality transformer, taking compensated features and uncertainty maps of two modalities as inputs to cross-modally enhance features.
  • Figure 5: Qualitative visualizations of UniBEV and CMT detectors w/ or w/o ModalPatch under a 50% drop rate. Each pair compares the baseline detector (Base) with the detector enhanced by ModalPatch (+ModalPatch), where red boxes denote ground-truth objects and blue ones denote detected objects.
  • ...and 1 more figures