Table of Contents
Fetching ...

EIMC: Efficient Instance-aware Multi-modal Collaborative Perception

Kang Yang, Peng Wang, Lantao Li, Tianci Bu, Chen Sun, Deying Li, Yongcai Wang

TL;DR

EIMC innovatively proposes an early collaborative paradigm that injects lightweight collaborative voxels, transmitted by neighbor agents, into the ego's local modality-fusion step, yielding compact yet informative 3D collaborative priors that tighten cross-modal alignment.

Abstract

Multi-modal collaborative perception calls for great attention to enhancing the safety of autonomous driving. However, current multi-modal approaches remain a ``local fusion to communication'' sequence, which fuses multi-modal data locally and needs high bandwidth to transmit an individual's feature data before collaborative fusion. EIMC innovatively proposes an early collaborative paradigm. It injects lightweight collaborative voxels, transmitted by neighbor agents, into the ego's local modality-fusion step, yielding compact yet informative 3D collaborative priors that tighten cross-modal alignment. Next, a heatmap-driven consensus protocol identifies exactly where cooperation is needed by computing per-pixel confidence heatmaps. Only the Top-K instance vectors located in these low-confidence, high-discrepancy regions are queried from peers, then fused via cross-attention for completion. Afterwards, we apply a refinement fusion that involves collecting the top-K most confident instances from each agent and enhancing their features using self-attention. The above instance-centric messaging reduces redundancy while guaranteeing that critical occluded objects are recovered. Evaluated on OPV2V and DAIR-V2X, EIMC attains 73.01\% AP@0.5 while reducing byte bandwidth usage by 87.98\% compared with the best published multi-modal collaborative detector. Code publicly released at https://github.com/sidiangongyuan/EIMC.

EIMC: Efficient Instance-aware Multi-modal Collaborative Perception

TL;DR

EIMC innovatively proposes an early collaborative paradigm that injects lightweight collaborative voxels, transmitted by neighbor agents, into the ego's local modality-fusion step, yielding compact yet informative 3D collaborative priors that tighten cross-modal alignment.

Abstract

Multi-modal collaborative perception calls for great attention to enhancing the safety of autonomous driving. However, current multi-modal approaches remain a ``local fusion to communication'' sequence, which fuses multi-modal data locally and needs high bandwidth to transmit an individual's feature data before collaborative fusion. EIMC innovatively proposes an early collaborative paradigm. It injects lightweight collaborative voxels, transmitted by neighbor agents, into the ego's local modality-fusion step, yielding compact yet informative 3D collaborative priors that tighten cross-modal alignment. Next, a heatmap-driven consensus protocol identifies exactly where cooperation is needed by computing per-pixel confidence heatmaps. Only the Top-K instance vectors located in these low-confidence, high-discrepancy regions are queried from peers, then fused via cross-attention for completion. Afterwards, we apply a refinement fusion that involves collecting the top-K most confident instances from each agent and enhancing their features using self-attention. The above instance-centric messaging reduces redundancy while guaranteeing that critical occluded objects are recovered. Evaluated on OPV2V and DAIR-V2X, EIMC attains 73.01\% AP@0.5 while reducing byte bandwidth usage by 87.98\% compared with the best published multi-modal collaborative detector. Code publicly released at https://github.com/sidiangongyuan/EIMC.
Paper Structure (25 sections, 18 equations, 8 figures, 7 tables)

This paper contains 25 sections, 18 equations, 8 figures, 7 tables.

Figures (8)

  • Figure 1: Compared with other intermediate fusion methods, EIMC achieves lower communication volume while still attaining the best performance. The $\text{Ours}^\dagger$ variant represents the version without the Mix-Voxel module. BM2CP bm2cp is the multimodal-based collaborative perception work.
  • Figure 2: Framework. Given LiDAR and camera inputs, our method first extracts heterogeneous features through dedicated modality-specific encoders. The Mix-Voxel (MV) module leverages lightweight voxel transmission as priors to build the collaborative voxel and then constructs occupancy-guided voxel-based image representations, which are compressed into BEV features and fused with LiDAR BEV features through Heterogeneous Modality Fusion (HMF). Instance Completion (IC) and Instance Refinement (IR) modules subsequently propagate instance-level messages identified from heatmap priors. The collaboration employs multi-scale feature for final detection, with predicted and ground truth bounding boxes visualized as green and red boxes respectively.
  • Figure 3: Mix-Voxel module constructs a local graph of voxels, utilizing self-attention mechanisms to facilitate information exchange.
  • Figure 4: Heterogeneous Modality Fusion module. The module consists of two streams. The left part uses an attention-based method to establish the interaction between the two modalities, while the right part employs a basic operation (e.g., concatenation) followed by convolutional layers to fuse the modalities.
  • Figure 5: Instance-level communication. The Instance Completion (IC) module prioritizes critical regions by analyzing cross-agent heatmap discrepancies, performing instance completion via cross-attention. The IR module first selects agent-specific instances from heatmaps, then refines them via self-attention. Finally, it aggregates instance-to-scene context by cross-attending BEV features (query) to instance representations (key/value).
  • ...and 3 more figures