Table of Contents
Fetching ...

Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving

Mayank Mayank, Bharanidhar Duraisamy, Florian Geiß, Abhinav Valada

Abstract

Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.

Multi-Modal Sensor Fusion using Hybrid Attention for Autonomous Driving

Abstract

Accurate 3D object detection for autonomous driving requires complementary sensors. Cameras provide dense semantics but unreliable depth, while millimeter-wave radar offers precise range and velocity measurements with sparse geometry. We propose MMF-BEV, a radar-camera BEV fusion framework that leverages deformable attention for cross-modal feature alignment on the View-of-Delft (VoD) 4D radar dataset [1]. MMF-BEV builds a BEVDepth [2] camera branch and a RadarBEVNet [3] radar branch, each enhanced with Deformable Self-Attention, and fuses them via a Deformable Cross-Attention module. We evaluate three configurations: camera-only, radar-only, and hybrid fusion. A sensor contribution analysis quantifies per-distance modality weighting, providing interpretable evidence of sensor complementarity. A two-stage training strategy - pre-training the camera branch with depth supervision, then jointly training radar and fusion modules stabilizes learning. Experiments on VoD show that MMF-BEV consistently outperforms unimodal baselines and achieves competitive results against prior fusion methods across all object classes in both the full annotated area and near-range Region of Interest.

Paper Structure

This paper contains 21 sections, 14 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: BEVDepth Framework li2023bevdepth
  • Figure 2: Architecture of the RCBEVDET's dual-stream radar backbone (Pointbased Block and Transformer-based Block) lin2024rcbevdet.
  • Figure 3: Overall pipeline of MMF-BEV. Front-view camera images are transformed into BEV features using a BEVDepth-based camera branch, while 4D radar point clouds are encoded via RadarBEVNet to produce radar BEV representations. Both modalities are refined with Deformable Self-Attention and fused through Multi Layer hybrid fusion module in BEV space. The fused multi-modal BEV features are used for 3D object detection and sensor specific confidence for detections.
  • Figure 4: MultiLayer Hybrid fusion module. After per-modality DSA refinement, camera features and radar features exchange information through deformable cross-attention (top and bottom), with outputs concatenated through CBR fusion layers.
  • Figure 5: Qualitative comparison of intermediate BEV feature representations for a VoD validation scene Id - 00000.
  • ...and 3 more figures