Table of Contents
Fetching ...

SiMO: Single-Modality-Operable Multimodal Collaborative Perception

Jiageng Wen, Shengjie Zhao, Bing Li, Jiafeng Huang, Kenan Ye, Hao Deng

TL;DR

Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality-specific features, enabling it to maintain optimal performance across all individual modalities.

Abstract

Collaborative perception integrates multi-agent perspectives to enhance the sensing range and overcome occlusion issues. While existing multimodal approaches leverage complementary sensors to improve performance, they are highly prone to failure--especially when a key sensor like LiDAR is unavailable. The root cause is that feature fusion leads to semantic mismatches between single-modality features and the downstream modules. This paper addresses this challenge for the first time in the field of collaborative perception, introducing Single-Modality-Operable Multimodal Collaborative Perception (SiMO). By adopting the proposed Length-Adaptive Multi-Modal Fusion (LAMMA), SiMO can adaptively handle remaining modal features during modal failures while maintaining consistency of the semantic space. Additionally, leveraging the innovative "Pretrain-Align-Fuse-RD" training strategy, SiMO addresses the issue of modality competition--generally overlooked by existing methods--ensuring the independence of each individual modality branch. Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality-specific features, enabling it to maintain optimal performance across all individual modalities. The implementation details can be found in https://github.com/dempsey-wen/SiMO.

SiMO: Single-Modality-Operable Multimodal Collaborative Perception

TL;DR

Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality-specific features, enabling it to maintain optimal performance across all individual modalities.

Abstract

Collaborative perception integrates multi-agent perspectives to enhance the sensing range and overcome occlusion issues. While existing multimodal approaches leverage complementary sensors to improve performance, they are highly prone to failure--especially when a key sensor like LiDAR is unavailable. The root cause is that feature fusion leads to semantic mismatches between single-modality features and the downstream modules. This paper addresses this challenge for the first time in the field of collaborative perception, introducing Single-Modality-Operable Multimodal Collaborative Perception (SiMO). By adopting the proposed Length-Adaptive Multi-Modal Fusion (LAMMA), SiMO can adaptively handle remaining modal features during modal failures while maintaining consistency of the semantic space. Additionally, leveraging the innovative "Pretrain-Align-Fuse-RD" training strategy, SiMO addresses the issue of modality competition--generally overlooked by existing methods--ensuring the independence of each individual modality branch. Experiments demonstrate that SiMO effectively aligns multimodal features while simultaneously preserving modality-specific features, enabling it to maintain optimal performance across all individual modalities. The implementation details can be found in https://github.com/dempsey-wen/SiMO.
Paper Structure (37 sections, 5 equations, 10 figures, 14 tables, 1 algorithm)

This paper contains 37 sections, 5 equations, 10 figures, 14 tables, 1 algorithm.

Figures (10)

  • Figure 1: a) Existing methods perform like a series circuit and fail with any modal failure, while ours performs like a parallel circuit working with any effective branch. b) The common feature fusion methods cause space shifts, rendering unfused features not fitting to the downstream task heads. c) Aligning features before fusion keeps the consistency among unfused, fused and multi-agent features.
  • Figure 2: a) The overview of SiMO. b) LAMMA adaptively downgrades to Self-Attention fusion to keep consistent feature processing when modal failure happens. c) SiMO conducts balanced multimodal learning to keep modality-specific features for branch independence.
  • Figure 3: The training process of SiMO. (1) Load pretrained feature extractors. (2) Train each aligner with the extractor frozen. (3) Train LAMMA with two-modal input, freezing aligners and task heads (w/o RD). (4) Fine-tune LAMMA with RD to adapt modal failure.
  • Figure 4: Comparison of SiMO-PF, BEVFusion and Pyramid Fusion (L) in varying extents of LiDAR failure.
  • Figure 5: Visualization of SiMO-PF's and HEAL's (Pyramid Fusion) detection results, ground truth and collaborative point clouds. (a)(b)(c) are detection visualizations with all, 15000, 5000 LiDAR points (by occluding local point clouds), and (d) is detection visualizations with only camera images. SiMO-PF detects most objects even with 5000 LiDAR points, while HEAL misses the most.
  • ...and 5 more figures