Table of Contents
Fetching ...

CoBEVMoE: Heterogeneity-aware Feature Fusion with Dynamic Mixture-of-Experts for Collaborative Perception

Lingzhao Kong, Jiacheng Lin, Siyu Li, Kai Luo, Zhiyong Li, Kailun Yang

TL;DR

CoBEVMoE tackles heterogeneity in multi-agent collaborative perception by introducing a Dynamic Mixture-of-Experts (DMoE) that generates agent-conditioned kernels and a Dynamic Expert Metric Loss (DEML) to preserve expert diversity. The framework fuses BEV features through dynamic kernels and a gating mechanism, while a self-attention backbone provides a robust initial fusion; the final representation is enriched by a residual MoE fusion. Empirical results on OPV2V and DAIR-V2X-C show state-of-the-art gains in BEV semantic segmentation and 3D object detection, with notable increases in IoU and AP, and a substantial rise in inter-expert diversity. The method demonstrates that explicitly modeling inter-agent diversity and dynamically adapting to each agent’s viewpoint yields more informative fused representations, with practical implications for safer, more reliable autonomous driving in networked scenarios.

Abstract

Collaborative perception aims to extend sensing coverage and improve perception accuracy by sharing information among multiple agents. However, due to differences in viewpoints and spatial positions, agents often acquire heterogeneous observations. Existing intermediate fusion methods primarily focus on aligning similar features, often overlooking the perceptual diversity among agents. To address this limitation, we propose CoBEVMoE, a novel collaborative perception framework that operates in the Bird's Eye View (BEV) space and incorporates a Dynamic Mixture-of-Experts (DMoE) architecture. In DMoE, each expert is dynamically generated based on the input features of a specific agent, enabling it to extract distinctive and reliable cues while attending to shared semantics. This design allows the fusion process to explicitly model both feature similarity and heterogeneity across agents. Furthermore, we introduce a Dynamic Expert Metric Loss (DEML) to enhance inter-expert diversity and improve the discriminability of the fused representation. Extensive experiments on the OPV2V and DAIR-V2X-C datasets demonstrate that CoBEVMoE achieves state-of-the-art performance. Specifically, it improves the IoU for Camera-based BEV segmentation by +1.5% on OPV2V and the AP@0.5 for LiDAR-based 3D object detection by +3.0% on DAIR-V2X-C, verifying the effectiveness of expert-based heterogeneous feature modeling in multi-agent collaborative perception. The source code will be made publicly available at https://github.com/godk0509/CoBEVMoE.

CoBEVMoE: Heterogeneity-aware Feature Fusion with Dynamic Mixture-of-Experts for Collaborative Perception

TL;DR

CoBEVMoE tackles heterogeneity in multi-agent collaborative perception by introducing a Dynamic Mixture-of-Experts (DMoE) that generates agent-conditioned kernels and a Dynamic Expert Metric Loss (DEML) to preserve expert diversity. The framework fuses BEV features through dynamic kernels and a gating mechanism, while a self-attention backbone provides a robust initial fusion; the final representation is enriched by a residual MoE fusion. Empirical results on OPV2V and DAIR-V2X-C show state-of-the-art gains in BEV semantic segmentation and 3D object detection, with notable increases in IoU and AP, and a substantial rise in inter-expert diversity. The method demonstrates that explicitly modeling inter-agent diversity and dynamically adapting to each agent’s viewpoint yields more informative fused representations, with practical implications for safer, more reliable autonomous driving in networked scenarios.

Abstract

Collaborative perception aims to extend sensing coverage and improve perception accuracy by sharing information among multiple agents. However, due to differences in viewpoints and spatial positions, agents often acquire heterogeneous observations. Existing intermediate fusion methods primarily focus on aligning similar features, often overlooking the perceptual diversity among agents. To address this limitation, we propose CoBEVMoE, a novel collaborative perception framework that operates in the Bird's Eye View (BEV) space and incorporates a Dynamic Mixture-of-Experts (DMoE) architecture. In DMoE, each expert is dynamically generated based on the input features of a specific agent, enabling it to extract distinctive and reliable cues while attending to shared semantics. This design allows the fusion process to explicitly model both feature similarity and heterogeneity across agents. Furthermore, we introduce a Dynamic Expert Metric Loss (DEML) to enhance inter-expert diversity and improve the discriminability of the fused representation. Extensive experiments on the OPV2V and DAIR-V2X-C datasets demonstrate that CoBEVMoE achieves state-of-the-art performance. Specifically, it improves the IoU for Camera-based BEV segmentation by +1.5% on OPV2V and the AP@0.5 for LiDAR-based 3D object detection by +3.0% on DAIR-V2X-C, verifying the effectiveness of expert-based heterogeneous feature modeling in multi-agent collaborative perception. The source code will be made publicly available at https://github.com/godk0509/CoBEVMoE.

Paper Structure

This paper contains 22 sections, 19 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: (a) Conventional attention-based fusion suppresses agent-specific cues, causing missed perception (red box). Our DMoE-based fusion preserves heterogeneous features and correctly segments the target. (b) Quantitative comparison on OPV2V BEV semantic segmentation and DAIR-V2X-C 3D object detection, showing consistent IoU and AP improvements.
  • Figure 2: Illustration of the proposed CoBEVMoE framework. Each agent first extracts BEV features from its raw sensory inputs, which are then spatially aligned and transmitted to a central aggregator. A Dynamic Mixture-of-Experts (DMoE) module generates agent-specific expert kernels and performs adaptive feature aggregation through a learned gating mechanism. The resulting fused representation is subsequently decoded for downstream perception tasks, including semantic segmentation and 3D object detection. To encourage inter-expert diversity while maintaining consistency with the fused feature, we introduce a Dynamic Expert Metric Loss (DEML), depicted in $\mathcal{L}_\text{DEML}$ in the figure. This loss promotes diverse yet coherent expert representations, ultimately enhancing the quality of the final fusion.
  • Figure 3: Qualitative comparison among different expert feature maps. (a) shows the ground truth BEV semantic map. (b) shows the predicted BEV semantic map, while (c-f) depict the activation maps of four different experts in our DMoE module. The four experts focus on different spatial patterns, indicating the diversity in feature extraction. This complementary behavior contributes to a more comprehensive and robust feature fusion in the downstream task.
  • Figure 4: Qualitative results of collaborative BEV segmentation. From left to right: (a) ego vehicle's front camera image, (b–c) front camera images from two collaborating agents (cav1 and cav2), (d) ground truth BEV segmentation, (e) result of the baseline fusion method, and (f) result of our proposed CoBEVMoE. Compared to the baseline, our method produces more complete and precise semantic segmentation, especially for distant or partially occluded targets, demonstrating its ability to effectively integrate complementary views and retain agent-specific information.
  • Figure 5: Visualization of BEV feature maps before and after the DMoE fusion module. (a) shows the ground truth BEV semantic map. (b) shows the predicted BEV semantic map. (c) presents the feature activation of a representative channel before DMoE fusion, while (d) displays the output after fusion. The DMoE-enhanced feature map demonstrates clearer and more focused activations, indicating that our expert-based fusion improves feature quality and semantic alignment.