Table of Contents
Fetching ...

DPFT: Dual Perspective Fusion Transformer for Camera-Radar-based Object Detection

Felix Fent, Andras Palffy, Holger Caesar

TL;DR

This work tackles robust, cost-effective 3D object detection for autonomous driving by fusing camera data with raw 4D radar cube data. It introduces the Dual Perspective Fusion Transformer (DPFT), which projects radar cubes into two complementary views (range-azimuth and azimuth-elevation) and fuses them with image features through deformable attention, without enforcing a single BEV feature space. DPFT demonstrates state-of-the-art performance on the K-Radar dataset, notably under severe weather, while achieving fast inference (~$87$ ms) and graceful degradation under sensor failure. The approach broadens multimodal fusion by leveraging high-dimensional radar data and dual-perspective querying, offering a robust and scalable solution for real-world autonomous driving perception.

Abstract

The perception of autonomous vehicles has to be efficient, robust, and cost-effective. However, cameras are not robust against severe weather conditions, lidar sensors are expensive, and the performance of radar-based perception is still inferior to the others. Camera-radar fusion methods have been proposed to address this issue, but these are constrained by the typical sparsity of radar point clouds and often designed for radars without elevation information. We propose a novel camera-radar fusion approach called Dual Perspective Fusion Transformer (DPFT), designed to overcome these limitations. Our method leverages lower-level radar data (the radar cube) instead of the processed point clouds to preserve as much information as possible and employs projections in both the camera and ground planes to effectively use radars with elevation information and simplify the fusion with camera data. As a result, DPFT has demonstrated state-of-the-art performance on the K-Radar dataset while showing remarkable robustness against adverse weather conditions and maintaining a low inference time. The code is made available as open-source software under https://github.com/TUMFTM/DPFT.

DPFT: Dual Perspective Fusion Transformer for Camera-Radar-based Object Detection

TL;DR

This work tackles robust, cost-effective 3D object detection for autonomous driving by fusing camera data with raw 4D radar cube data. It introduces the Dual Perspective Fusion Transformer (DPFT), which projects radar cubes into two complementary views (range-azimuth and azimuth-elevation) and fuses them with image features through deformable attention, without enforcing a single BEV feature space. DPFT demonstrates state-of-the-art performance on the K-Radar dataset, notably under severe weather, while achieving fast inference (~ ms) and graceful degradation under sensor failure. The approach broadens multimodal fusion by leveraging high-dimensional radar data and dual-perspective querying, offering a robust and scalable solution for real-world autonomous driving perception.

Abstract

The perception of autonomous vehicles has to be efficient, robust, and cost-effective. However, cameras are not robust against severe weather conditions, lidar sensors are expensive, and the performance of radar-based perception is still inferior to the others. Camera-radar fusion methods have been proposed to address this issue, but these are constrained by the typical sparsity of radar point clouds and often designed for radars without elevation information. We propose a novel camera-radar fusion approach called Dual Perspective Fusion Transformer (DPFT), designed to overcome these limitations. Our method leverages lower-level radar data (the radar cube) instead of the processed point clouds to preserve as much information as possible and employs projections in both the camera and ground planes to effectively use radars with elevation information and simplify the fusion with camera data. As a result, DPFT has demonstrated state-of-the-art performance on the K-Radar dataset while showing remarkable robustness against adverse weather conditions and maintaining a low inference time. The code is made available as open-source software under https://github.com/TUMFTM/DPFT.
Paper Structure (23 sections, 1 equation, 6 figures, 9 tables)

This paper contains 23 sections, 1 equation, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Illustration of the dual perspective fusion procedure. The 4D radar cube is projected onto a front and bird's eye view to create a parallel and perpendicular perspective to the camera image. This simplifies the camera-radar fusion and maintains the complementary sensor features. Object features are queried from these perspectives via an attention mechanism and used to regress 3D detections.
  • Figure 2: The DPFT model overview shows the essential steps to fuse camera data with raw 4D radar data and retrieve objects from it. First ①, the data of the 4D radar cube is projected onto the range-azimuth (RA) and azimuth-elevation (AE) plane. Second ②, the two radar perspectives and the camera data are fed through individual ResNet backbones to extract essential features from them. In the ③ step, Feature Pyramid Networks (FPN) are used to align the dimensions of the multi-level feature maps. To fuse the features of the different perspectives, a set of query points is initialized in 3D space in the ④ step and projected onto the different perspectives in the ⑤ step. After that, the features hit by the projection points are fused in the associated query points, using deformable attention ⑥. A classification and regression head is used in ⑦ to retrieve bounding boxes from the queried features. Finally, the regressed bounding box positions are used as new query points in step ⑧ and their features are updated ⑨ in an iterative process to refine the bounding box proposals.
  • Figure 3: Exemplary results of the model performance under night, rain, snow, and backlight conditions. The ground truth is shown in blue and the model prediction in orange.
  • Figure 4: Performance loss due to the ablation of individual model components on the test data of the K-Radar dataset revision v2.0.
  • Figure 5: Visualization of the dataset's sensor miscalibration (left) and two failure cases of the model. One shows a missing detection of a crossing object (center) and the other shows false negatives for partially occluded objects (right). The ground truth is shown in blue and the model prediction in orange.
  • ...and 1 more figures