MIC-BEV: Multi-Infrastructure Camera Bird's-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection
Yun Zhang, Zhaoliang Zheng, Johnson Liu, Zhiyu Huang, Zewei Zhou, Zonglin Meng, Tianhui Cai, Jiaqi Ma
TL;DR
MIC-BEV tackles the challenge of infrastructure-based multi-camera 3D object detection with heterogeneous extrinsics and intrinsic parameters by introducing a Transformer-based BEV framework that explicitly models camera–BEV geometry through a Graph Attention Network. The model fuses multi-view features via Relation-Enhanced Spatial Cross-Attention, leveraging a bipartite camera-BEV graph and deformable attention over 3D reference points, while also learning map-level and object-level BEV segmentation priors. To enable robust evaluation across diverse deployments, the authors introduce the M2I synthetic dataset and validate on RoScenes, achieving state-of-the-art performance (e.g., mAP $=0.767$, NDS $=0.771$ on M2I-Normal) and strong robustness under sensor degradation and adverse weather. The work advances infrastructure sensing by delivering interpretable view weighting, strong generalization to varied camera layouts, and practical deployment potential for traffic monitoring and cooperative autonomy.
Abstract
Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird's-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.
