Table of Contents
Fetching ...

Multi-Modal Decouple and Recouple Network for Robust 3D Object Detection

Rui Ding, Zhaonian Kuang, Yuzhe Ji, Meng Yang, Xinhu Zheng, Gang Hua

TL;DR

A Multi-Modal Decouple and Recouple Network for robust 3D object detection under data corruption that allows invariant features to compensate each other while mitigates the negative impact of a corrupted modality on the other.

Abstract

Multi-modal 3D object detection with bird's eye view (BEV) has achieved desired advances on benchmarks. Nonetheless, the accuracy may drop significantly in the real world due to data corruption such as sensor configurations for LiDAR and scene conditions for camera. One design bottleneck of previous models resides in the tightly coupling of multi-modal BEV features during fusion, which may degrade the overall system performance if one modality or both is corrupted. To mitigate, we propose a Multi-Modal Decouple and Recouple Network for robust 3D object detection under data corruption. Different modalities commonly share some high-level invariant features. We observe that these invariant features across modalities do not always fail simultaneously, because different types of data corruption affect each modality in distinct ways.These invariant features can be recovered across modalities for robust fusion under data corruption.To this end, we explicitly decouple Camera/LiDAR BEV features into modality-invariant and modality-specific parts. It allows invariant features to compensate each other while mitigates the negative impact of a corrupted modality on the other.We then recouple these features into three experts to handle different types of data corruption, respectively, i.e., LiDAR, camera, and both.For each expert, we use modality-invariant features as robust information, while modality-specific features serve as a complement.Finally, we adaptively fuse the three experts to exact robust features for 3D object detection. For validation, we collect a benchmark with a large quantity of data corruption for LiDAR, camera, and both based on nuScenes. Our model is trained on clean nuScenes and tested on all types of data corruption. Our model consistently achieves the best accuracy on both corrupted and clean data compared to recent models.

Multi-Modal Decouple and Recouple Network for Robust 3D Object Detection

TL;DR

A Multi-Modal Decouple and Recouple Network for robust 3D object detection under data corruption that allows invariant features to compensate each other while mitigates the negative impact of a corrupted modality on the other.

Abstract

Multi-modal 3D object detection with bird's eye view (BEV) has achieved desired advances on benchmarks. Nonetheless, the accuracy may drop significantly in the real world due to data corruption such as sensor configurations for LiDAR and scene conditions for camera. One design bottleneck of previous models resides in the tightly coupling of multi-modal BEV features during fusion, which may degrade the overall system performance if one modality or both is corrupted. To mitigate, we propose a Multi-Modal Decouple and Recouple Network for robust 3D object detection under data corruption. Different modalities commonly share some high-level invariant features. We observe that these invariant features across modalities do not always fail simultaneously, because different types of data corruption affect each modality in distinct ways.These invariant features can be recovered across modalities for robust fusion under data corruption.To this end, we explicitly decouple Camera/LiDAR BEV features into modality-invariant and modality-specific parts. It allows invariant features to compensate each other while mitigates the negative impact of a corrupted modality on the other.We then recouple these features into three experts to handle different types of data corruption, respectively, i.e., LiDAR, camera, and both.For each expert, we use modality-invariant features as robust information, while modality-specific features serve as a complement.Finally, we adaptively fuse the three experts to exact robust features for 3D object detection. For validation, we collect a benchmark with a large quantity of data corruption for LiDAR, camera, and both based on nuScenes. Our model is trained on clean nuScenes and tested on all types of data corruption. Our model consistently achieves the best accuracy on both corrupted and clean data compared to recent models.
Paper Structure (42 sections, 10 equations, 8 figures, 15 tables)

This paper contains 42 sections, 10 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Effect of data corruption on decoupled Camera/LiDAR features. (a)-(c) We measure the similarity (with the metric in CKA1CKA2CKA3) between clean and corrupted features for modality-invariant and modality-specific parts, respectively. The modality-specific features of LiDAR and Camera are affected seriously under different types of data corruption, while the modality-invariant features are affected slightly. Moreover, modality-invariant features always exhibit higher stability under different corruption severities. These modality-invariant features can be used for robust fusion. (d) Compared to BEVFusion liu2023bevfusion, in which LiDAR/Camera BEV features are tightly coupled, our method achieves significant gains on both corrupted and clean data.
  • Figure 2: Overview of our framework. First, the LiDAR and camera encoders extract BEV features from point cloud and multi-view images, respectively. Second, the Modality Decouple Module decouples LiDAR/camera BEV features into modality-invariant and modality-specific parts. Then, the Modality Recouple Module recouples the invariant and specific parts into three experts for LiDAR, camera, and fusion, respectively. Finally, the three experts are adaptively fused to generate robust features for fusion based on corruption level of each modality. The detection heads take the fused features for robust 3D object detection. Our model is trained on clean data, and tested on various types of data corruption in the inference to ensure its robustness in the real world.
  • Figure 3: Modality Decouple Module. Three encoders decouple features into modality-invariant and modality-specific parts. The specific features capture information unique to each modality. Modality invariant encoder extracts invariant features across modalities. $L_{Sim}$ enforces consistency output features between camera and LiDAR, while $L_{Diff}$ separates them from modality-specific ones. An auxiliary invariant head, used only during training, ensures the invariant features are truly useful for detection.
  • Figure 4: Cross-modal recouple in the modality recouple module. The cross-modal recouple consists of one self-attention and two cross-attention. In cross-attention, the sample offsets and weights are learnable, therefore cross-attention can dynamically extract useful information from corrupted and invariant features as supplements.
  • Figure 5: Examples of data corruption. We simulate various types of data corruption on cameras, LiDAR and both in real world, and add them to clean nuScenes validation images to generate corrupted data for testing.
  • ...and 3 more figures