DPFT: Dual Perspective Fusion Transformer for Camera-Radar-based Object Detection

Felix Fent; Andras Palffy; Holger Caesar

DPFT: Dual Perspective Fusion Transformer for Camera-Radar-based Object Detection

Felix Fent, Andras Palffy, Holger Caesar

TL;DR

This work tackles robust, cost-effective 3D object detection for autonomous driving by fusing camera data with raw 4D radar cube data. It introduces the Dual Perspective Fusion Transformer (DPFT), which projects radar cubes into two complementary views (range-azimuth and azimuth-elevation) and fuses them with image features through deformable attention, without enforcing a single BEV feature space. DPFT demonstrates state-of-the-art performance on the K-Radar dataset, notably under severe weather, while achieving fast inference (~$87$ ms) and graceful degradation under sensor failure. The approach broadens multimodal fusion by leveraging high-dimensional radar data and dual-perspective querying, offering a robust and scalable solution for real-world autonomous driving perception.

Abstract

The perception of autonomous vehicles has to be efficient, robust, and cost-effective. However, cameras are not robust against severe weather conditions, lidar sensors are expensive, and the performance of radar-based perception is still inferior to the others. Camera-radar fusion methods have been proposed to address this issue, but these are constrained by the typical sparsity of radar point clouds and often designed for radars without elevation information. We propose a novel camera-radar fusion approach called Dual Perspective Fusion Transformer (DPFT), designed to overcome these limitations. Our method leverages lower-level radar data (the radar cube) instead of the processed point clouds to preserve as much information as possible and employs projections in both the camera and ground planes to effectively use radars with elevation information and simplify the fusion with camera data. As a result, DPFT has demonstrated state-of-the-art performance on the K-Radar dataset while showing remarkable robustness against adverse weather conditions and maintaining a low inference time. The code is made available as open-source software under https://github.com/TUMFTM/DPFT.

DPFT: Dual Perspective Fusion Transformer for Camera-Radar-based Object Detection

TL;DR

ms) and graceful degradation under sensor failure. The approach broadens multimodal fusion by leveraging high-dimensional radar data and dual-perspective querying, offering a robust and scalable solution for real-world autonomous driving perception.

Abstract

Paper Structure (23 sections, 1 equation, 6 figures, 9 tables)

This paper contains 23 sections, 1 equation, 6 figures, 9 tables.

Introduction
Related Work
Camera-Radar Datasets
Camera-based 3D Object Detection
Radar-based 3D Object Detection
Camera-Radar Fusion for 3D Object Detection
Methodology
Data Preparation
Feature Extraction
Sensor Fusion
Object Detection
Model Training
Results
Robustness
Complexity
...and 8 more sections

Figures (6)

Figure 1: Illustration of the dual perspective fusion procedure. The 4D radar cube is projected onto a front and bird's eye view to create a parallel and perpendicular perspective to the camera image. This simplifies the camera-radar fusion and maintains the complementary sensor features. Object features are queried from these perspectives via an attention mechanism and used to regress 3D detections.
Figure 2: The DPFT model overview shows the essential steps to fuse camera data with raw 4D radar data and retrieve objects from it. First ①, the data of the 4D radar cube is projected onto the range-azimuth (RA) and azimuth-elevation (AE) plane. Second ②, the two radar perspectives and the camera data are fed through individual ResNet backbones to extract essential features from them. In the ③ step, Feature Pyramid Networks (FPN) are used to align the dimensions of the multi-level feature maps. To fuse the features of the different perspectives, a set of query points is initialized in 3D space in the ④ step and projected onto the different perspectives in the ⑤ step. After that, the features hit by the projection points are fused in the associated query points, using deformable attention ⑥. A classification and regression head is used in ⑦ to retrieve bounding boxes from the queried features. Finally, the regressed bounding box positions are used as new query points in step ⑧ and their features are updated ⑨ in an iterative process to refine the bounding box proposals.
Figure 3: Exemplary results of the model performance under night, rain, snow, and backlight conditions. The ground truth is shown in blue and the model prediction in orange.
Figure 4: Performance loss due to the ablation of individual model components on the test data of the K-Radar dataset revision v2.0.
Figure 5: Visualization of the dataset's sensor miscalibration (left) and two failure cases of the model. One shows a missing detection of a crossing object (center) and the other shows false negatives for partially occluded objects (right). The ground truth is shown in blue and the model prediction in orange.
...and 1 more figures

DPFT: Dual Perspective Fusion Transformer for Camera-Radar-based Object Detection

TL;DR

Abstract

DPFT: Dual Perspective Fusion Transformer for Camera-Radar-based Object Detection

Authors

TL;DR

Abstract

Table of Contents

Figures (6)