Table of Contents
Fetching ...

MIC-BEV: Multi-Infrastructure Camera Bird's-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection

Yun Zhang, Zhaoliang Zheng, Johnson Liu, Zhiyu Huang, Zewei Zhou, Zonglin Meng, Tianhui Cai, Jiaqi Ma

TL;DR

MIC-BEV tackles the challenge of infrastructure-based multi-camera 3D object detection with heterogeneous extrinsics and intrinsic parameters by introducing a Transformer-based BEV framework that explicitly models camera–BEV geometry through a Graph Attention Network. The model fuses multi-view features via Relation-Enhanced Spatial Cross-Attention, leveraging a bipartite camera-BEV graph and deformable attention over 3D reference points, while also learning map-level and object-level BEV segmentation priors. To enable robust evaluation across diverse deployments, the authors introduce the M2I synthetic dataset and validate on RoScenes, achieving state-of-the-art performance (e.g., mAP $=0.767$, NDS $=0.771$ on M2I-Normal) and strong robustness under sensor degradation and adverse weather. The work advances infrastructure sensing by delivering interpretable view weighting, strong generalization to varied camera layouts, and practical deployment potential for traffic monitoring and cooperative autonomy.

Abstract

Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird's-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.

MIC-BEV: Multi-Infrastructure Camera Bird's-Eye-View Transformer with Relation-Aware Fusion for 3D Object Detection

TL;DR

MIC-BEV tackles the challenge of infrastructure-based multi-camera 3D object detection with heterogeneous extrinsics and intrinsic parameters by introducing a Transformer-based BEV framework that explicitly models camera–BEV geometry through a Graph Attention Network. The model fuses multi-view features via Relation-Enhanced Spatial Cross-Attention, leveraging a bipartite camera-BEV graph and deformable attention over 3D reference points, while also learning map-level and object-level BEV segmentation priors. To enable robust evaluation across diverse deployments, the authors introduce the M2I synthetic dataset and validate on RoScenes, achieving state-of-the-art performance (e.g., mAP , NDS on M2I-Normal) and strong robustness under sensor degradation and adverse weather. The work advances infrastructure sensing by delivering interpretable view weighting, strong generalization to varied camera layouts, and practical deployment potential for traffic monitoring and cooperative autonomy.

Abstract

Infrastructure-based perception plays a crucial role in intelligent transportation systems, offering global situational awareness and enabling cooperative autonomy. However, existing camera-based detection models often underperform in such scenarios due to challenges such as multi-view infrastructure setup, diverse camera configurations, degraded visual inputs, and various road layouts. We introduce MIC-BEV, a Transformer-based bird's-eye-view (BEV) perception framework for infrastructure-based multi-camera 3D object detection. MIC-BEV flexibly supports a variable number of cameras with heterogeneous intrinsic and extrinsic parameters and demonstrates strong robustness under sensor degradation. The proposed graph-enhanced fusion module in MIC-BEV integrates multi-view image features into the BEV space by exploiting geometric relationships between cameras and BEV cells alongside latent visual cues. To support training and evaluation, we introduce M2I, a synthetic dataset for infrastructure-based object detection, featuring diverse camera configurations, road layouts, and environmental conditions. Extensive experiments on both M2I and the real-world dataset RoScenes demonstrate that MIC-BEV achieves state-of-the-art performance in 3D object detection. It also remains robust under challenging conditions, including extreme weather and sensor degradation. These results highlight the potential of MIC-BEV for real-world deployment. The dataset and source code are available at: https://github.com/HandsomeYun/MIC-BEV.

Paper Structure

This paper contains 31 sections, 13 equations, 8 figures, 12 tables, 2 algorithms.

Figures (8)

  • Figure 1: Overview of the proposed MIC-BEV framework for multi-camera infrastructure perception. (a) Example scenarios of infrastructure sensing with diverse road geometry and camera configuration. (b) Multiple cameras capture the scene from different viewpoints and project objects onto a defined BEV grid. (c) MIC-BEV encodes the geometric relations between each camera and BEV cell, such as distances, angles, and height differences, to construct edge features for a graph neural network (GNN). (d) The GNN in MIC-BEV performs geometric relation-aware fusion, assigning importance weights to each camera for adaptive multi-view feature aggregation.
  • Figure 2: Overview of the MIC-BEV architecture. The model takes multi-view images from a variable number of infrastructure-mounted cameras as input and extracts features through a shared backbone. A camera masking module applies random dropout or Gaussian noise to simulate degraded views. The extracted features are fused into a BEV representation via Transformer layers with temporal self-attention and our proposed Relation-Enhanced Spatial Cross-Attention. GAT networks are used to dynamically assign view-dependent weights based on camera node features and geometric relations between the camera and its visible BEV cells. The resulting BEV features are used for both object detection and BEV semantic segmentation tasks.
  • Figure 3: Scene diversity in the M2I dataset. (a) Examples of camera placement configurations across various intersection types, ranging from single-camera highway ramps to multi-camera urban intersections. (b) Diverse weather and illumination conditions, including rain, fog, dusk, and night scenes, highlighting domain variability. (c) Different levels of image blur representing sensor degradation from light to heavy. These variations illustrate the broad coverage and realism of M2I for infrastructure perception.
  • Figure 4: M2I Dataset Composition. (a) Scene composition showing splits by weather, lighting, road geometry, and camera count. (b) Agent composition illustrating class proportions, velocity profiles, and footprint areas across object types. (c) Camera composition showing histograms of camera height, pitch, and yaw across all scenes. These statistics demonstrate the diversity and balance of the M2I dataset for multi-camera perception.
  • Figure 5: Qualitative 3D detection results on the M2I dataset. MIC-BEV accurately detects multiple object types across diverse urban scenes with varying traffic densities and weather or lighting conditions.
  • ...and 3 more figures