Table of Contents
Fetching ...

PAN: Pillars-Attention-Based Network for 3D Object Detection

Ruan Bispo, Dane Mitrev, Letizia Mariotti, Clément Botty, Denver Humphrey, Anthony Scanlan, Ciarán Eising

TL;DR

PAN addresses robust, real-time 3D object detection under adverse conditions by fusing camera and radar data in BEV. It introduces a Pillars-Attention-Based Backbone for radar, a Radar-assisted View Transformation, and a Multi-modal Deformable Cross-Attention fusion, feeding a CenterPoint head. On nuScenes, PAN with a ResNet-50 backbone achieves about 58.2 NDS and ~29–30 FPS, outperforming LIDAR-camera baselines and exhibiting strong rain/night performance. The results are supported by ablations and qualitative analyses, demonstrating the effectiveness of radar features and transformer-based fusion for robust perception in autonomous driving.

Abstract

Camera-radar fusion offers a robust and low-cost alternative to Camera-lidar fusion for the 3D object detection task in real-time under adverse weather and lighting conditions. However, currently, in the literature, it is possible to find few works focusing on this modality and, most importantly, developing new architectures to explore the advantages of the radar point cloud, such as accurate distance estimation and speed information. Therefore, this work presents a novel and efficient 3D object detection algorithm using cameras and radars in the bird's-eye-view (BEV). Our algorithm exploits the advantages of radar before fusing the features into a detection head. A new backbone is introduced, which maps the radar pillar features into an embedded dimension. A self-attention mechanism allows the backbone to model the dependencies between the radar points. We are using a simplified convolutional layer to replace the FPN-based convolutional layers used in the PointPillars-based architectures with the main goal of reducing inference time. Our results show that with this modification, our approach achieves the new state-of-the-art in the 3D object detection problem, reaching 58.2 of the NDS metric for the use of ResNet-50, while also setting a new benchmark for inference time on the nuScenes dataset for the same category.

PAN: Pillars-Attention-Based Network for 3D Object Detection

TL;DR

PAN addresses robust, real-time 3D object detection under adverse conditions by fusing camera and radar data in BEV. It introduces a Pillars-Attention-Based Backbone for radar, a Radar-assisted View Transformation, and a Multi-modal Deformable Cross-Attention fusion, feeding a CenterPoint head. On nuScenes, PAN with a ResNet-50 backbone achieves about 58.2 NDS and ~29–30 FPS, outperforming LIDAR-camera baselines and exhibiting strong rain/night performance. The results are supported by ablations and qualitative analyses, demonstrating the effectiveness of radar features and transformer-based fusion for robust perception in autonomous driving.

Abstract

Camera-radar fusion offers a robust and low-cost alternative to Camera-lidar fusion for the 3D object detection task in real-time under adverse weather and lighting conditions. However, currently, in the literature, it is possible to find few works focusing on this modality and, most importantly, developing new architectures to explore the advantages of the radar point cloud, such as accurate distance estimation and speed information. Therefore, this work presents a novel and efficient 3D object detection algorithm using cameras and radars in the bird's-eye-view (BEV). Our algorithm exploits the advantages of radar before fusing the features into a detection head. A new backbone is introduced, which maps the radar pillar features into an embedded dimension. A self-attention mechanism allows the backbone to model the dependencies between the radar points. We are using a simplified convolutional layer to replace the FPN-based convolutional layers used in the PointPillars-based architectures with the main goal of reducing inference time. Our results show that with this modification, our approach achieves the new state-of-the-art in the 3D object detection problem, reaching 58.2 of the NDS metric for the use of ResNet-50, while also setting a new benchmark for inference time on the nuScenes dataset for the same category.

Paper Structure

This paper contains 18 sections, 5 equations, 6 figures, 6 tables, 1 algorithm.

Figures (6)

  • Figure 1: Overall architecture proposed. Firstly, the radar features are extracted using the PAN backbone and in parallel, camera features are extracted using an image backbone. Secondly, we transform these features into a BEV using the radar features through the Radar-assisted view transformation. Finally, the Multi-modal Feature Aggregation works by merging radar and camera BEV features to feed the 3D detection head.
  • Figure 2: Summary for the PAN backbone showing how this module can be used for this architecture or as a decoupled module for other perception tasks.
  • Figure 3: Self-attention branch diagram. The self-attention component can benefit from radar features such as sparse but accurate speed and distance to improve metric results while using low computational resources due to the nature of the data.
  • Figure 4: Detailed Feature Enhancement Block diagram, where the input is the sparse pseudo-image $(H,W,C)$, and the output is the enhanced features. Given a sparse pseudo-image with radar features, a mask is applied to remove all empty pillars. Encoding reduces the dimensionality of the features. A self-attention branch is sequentially applied to capture the most highly correlated features. After a fully connected decoder, a simplified convolutional block is applied to refine the close feature correlations and adjust the pseudo-image shape.
  • Figure 5: Visualization for the feature space. The feature maps shown in (a) and (b) and (c) and (d) are the models under two different scenes, where the lighter areas represent the regions most used for detection.
  • ...and 1 more figures