Table of Contents
Fetching ...

M^3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar

Xiaozhi Li, Huijun Di, Jian Li, Feng Liu, Wei Liang

TL;DR

M3Detection tackles robust 3D object detection for autonomous driving by fusing camera data with 4D imaging radar over multiple frames. The framework reuses intermediate features from a single-frame baseline detector and employs a tracker-generated reference trajectory to drive multi-frame, multi-level fusion via GOA, LGA, and MSTR, all within a memory-bank-based two-stage setup. Empirical results on VoD and TJ4DRadSet show state-of-the-art performance, demonstrating strong gains in both 3D and BEV metrics across challenging conditions, with preserved efficiency due to avoiding redundant feature re-extraction. The approach advances multi-modal, multi-frame perception and offers a plug-in strategy to improve existing camera-radar systems in adverse weather and dynamic scenes.

Abstract

Recent advances in 4D imaging radar have enabled robust perception in adverse weather, while camera sensors provide dense semantic information. Fusing the these complementary modalities has great potential for cost-effective 3D perception. However, most existing camera-radar fusion methods are limited to single-frame inputs, capturing only a partial view of the scene. The incomplete scene information, compounded by image degradation and 4D radar sparsity, hinders overall detection performance. In contrast, multi-frame fusion offers richer spatiotemporal information but faces two challenges: achieving robust and effective object feature fusion across frames and modalities, and mitigating the computational cost of redundant feature extraction. Consequently, we propose M^3Detection, a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D imaging radar. Our framework leverages intermediate features from the baseline detector and employs the tracker to produce reference trajectories, improving computational efficiency and providing richer information for second-stage. In the second stage, we design a global-level inter-object feature aggregation module guided by radar information to align global features across candidate proposals and a local-level inter-grid feature aggregation module that expands local features along the reference trajectories to enhance fine-grained object representation. The aggregated features are then processed by a trajectory-level multi-frame spatiotemporal reasoning module to encode cross-frame interactions and enhance temporal representation. Extensive experiments on the VoD and TJ4DRadSet datasets demonstrate that M^3Detection achieves state-of-the-art 3D detection performance, validating its effectiveness in multi-frame detection with camera-4D imaging radar fusion.

M^3Detection: Multi-Frame Multi-Level Feature Fusion for Multi-Modal 3D Object Detection with Camera and 4D Imaging Radar

TL;DR

M3Detection tackles robust 3D object detection for autonomous driving by fusing camera data with 4D imaging radar over multiple frames. The framework reuses intermediate features from a single-frame baseline detector and employs a tracker-generated reference trajectory to drive multi-frame, multi-level fusion via GOA, LGA, and MSTR, all within a memory-bank-based two-stage setup. Empirical results on VoD and TJ4DRadSet show state-of-the-art performance, demonstrating strong gains in both 3D and BEV metrics across challenging conditions, with preserved efficiency due to avoiding redundant feature re-extraction. The approach advances multi-modal, multi-frame perception and offers a plug-in strategy to improve existing camera-radar systems in adverse weather and dynamic scenes.

Abstract

Recent advances in 4D imaging radar have enabled robust perception in adverse weather, while camera sensors provide dense semantic information. Fusing the these complementary modalities has great potential for cost-effective 3D perception. However, most existing camera-radar fusion methods are limited to single-frame inputs, capturing only a partial view of the scene. The incomplete scene information, compounded by image degradation and 4D radar sparsity, hinders overall detection performance. In contrast, multi-frame fusion offers richer spatiotemporal information but faces two challenges: achieving robust and effective object feature fusion across frames and modalities, and mitigating the computational cost of redundant feature extraction. Consequently, we propose M^3Detection, a unified multi-frame 3D object detection framework that performs multi-level feature fusion on multi-modal data from camera and 4D imaging radar. Our framework leverages intermediate features from the baseline detector and employs the tracker to produce reference trajectories, improving computational efficiency and providing richer information for second-stage. In the second stage, we design a global-level inter-object feature aggregation module guided by radar information to align global features across candidate proposals and a local-level inter-grid feature aggregation module that expands local features along the reference trajectories to enhance fine-grained object representation. The aggregated features are then processed by a trajectory-level multi-frame spatiotemporal reasoning module to encode cross-frame interactions and enhance temporal representation. Extensive experiments on the VoD and TJ4DRadSet datasets demonstrate that M^3Detection achieves state-of-the-art 3D detection performance, validating its effectiveness in multi-frame detection with camera-4D imaging radar fusion.

Paper Structure

This paper contains 24 sections, 14 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: Comparison of multi-frame 3D detection frameworks. Single-stage methods apply LSTM or Transformer networks to fuse scene-level information across frames. Two-stage methods generate initial detections using a baseline detector and refine them by re-extracting features from trajectory point clouds. In contrast, our method eliminates redundant feature re-extraction and performs multi-frame multi-level feature fusion on multi-modal intermediate features, achieving enhanced performance while maintaining computational efficiency.
  • Figure 2: The overall architecture of M3Detection. The framework is a two-stage pipeline for multi-frame 3D detection with camera and 4D radar. In the first stage, a single-frame multi-modal baseline detector extracts local and global intermediate features and generates initial detection results, which are associated into reference trajectories and candidate proposals by a tracking module. These features and trajectories are stored in a memory bank for the second stage, where multi-frame feature aggregation and spatiotemporal reasoning are performed. GOA aligns global features across candidate proposal positions to improve recall while preserving precision, LGA expands the crop region around reference trajectory positions and leverages cross-level deformable attention to capture richer local context, and MSTR employs multi-head attention to enable trajectory-level spatiotemporal interactions across frames. Finally, fused global and local features are used for bounding box and category regression.
  • Figure 3: Multi-level feature aggregation based on reference trajectory and candidate proposals. GOA aligns and integrates features at candidate proposal positions to increase recall while maintaining feature precision. LGA performs crop region expansion and cross-level deformable attention along the reference trajectory to enrich local context and enlarge the receptive field.
  • Figure 4: Spatial correspondence between the initial detections and BEV indexes. Each 3D bounding box is spatially aligned with the global BEV representation during single-frame inference, enabling its associated feature values to be directly retrieved using the recorded BEV indexes.
  • Figure 5: Global-level inter-object feature aggregation (GOA). Based on candidate proposals, BEV features are aggregated from candidate positions to enrich the representation. Feature reliability is ensured by leveraging radar-aware features, spatial distribution of candidate proposals, and an adaptive weight matrix for robust aggregation.
  • ...and 4 more figures