Table of Contents
Fetching ...

DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds

Siqi Pei, Andras Palffy, Dariu M. Gavrila

TL;DR

This paper proposes DRIFT, a model that effectively captures and fuses both local and global contexts through a dual-path architecture that outperforms the baselines on the tasks of object detection and/or free road estimation.

Abstract

4D radars, which provide 3D point cloud data along with Doppler velocity, are attractive components of modern automated driving systems due to their low cost and robustness under adverse weather conditions. However, they provide a significantly lower point cloud density than LiDAR sensors. This makes it important to exploit not only local but also global contextual scene information. This paper proposes DRIFT, a model that effectively captures and fuses both local and global contexts through a dual-path architecture. The model incorporates a point path to aggregate fine-grained local features and a pillar path to encode coarse-grained global features. These two parallel paths are intertwined via novel feature-sharing layers at multiple stages, enabling full utilization of both representations. DRIFT is evaluated on the widely used View-of-Delft (VoD) dataset and a proprietary internal dataset. It outperforms the baselines on the tasks of object detection and/or free road estimation. For example, DRIFT achieves a mean average precision (mAP) of 52.6\% (compared to, say, 45.4\% of CenterPoint) on the VoD dataset.

DRIFT: Dual-Representation Inter-Fusion Transformer for Automated Driving Perception with 4D Radar Point Clouds

TL;DR

This paper proposes DRIFT, a model that effectively captures and fuses both local and global contexts through a dual-path architecture that outperforms the baselines on the tasks of object detection and/or free road estimation.

Abstract

4D radars, which provide 3D point cloud data along with Doppler velocity, are attractive components of modern automated driving systems due to their low cost and robustness under adverse weather conditions. However, they provide a significantly lower point cloud density than LiDAR sensors. This makes it important to exploit not only local but also global contextual scene information. This paper proposes DRIFT, a model that effectively captures and fuses both local and global contexts through a dual-path architecture. The model incorporates a point path to aggregate fine-grained local features and a pillar path to encode coarse-grained global features. These two parallel paths are intertwined via novel feature-sharing layers at multiple stages, enabling full utilization of both representations. DRIFT is evaluated on the widely used View-of-Delft (VoD) dataset and a proprietary internal dataset. It outperforms the baselines on the tasks of object detection and/or free road estimation. For example, DRIFT achieves a mean average precision (mAP) of 52.6\% (compared to, say, 45.4\% of CenterPoint) on the VoD dataset.
Paper Structure (17 sections, 5 figures, 5 tables)

This paper contains 17 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Example scene from the View-of-Delft apalffy2022 dataset with radar and annotation data. The two inlets on the left show corresponding LiDAR and radar point clouds around the same pedestrian. Unlike the LiDAR point cloud, it is difficult to detect the pedestrian using only the local region of the radar point cloud due to its sparsity. However, incorporating global information (i.e. the pedestrian's relative position to the ego-vehicle and the scene, e.g. drivable area marked with blue) with local features such as shape and velocity, the pedestrian's presence becomes more apparent.
  • Figure 2: Comparison of dual-representation model structures. FS denotes feature sharing block. (a) The sequential path processes the point cloud through a point-based path first, then through a voxel-based path, e.g. Leng2024. (b) The parallel path with fusion at the end processes both paths independently and merges their outputs before proceeding with a single-representation path, e.g. Liu2019JLiu2023. (c) Proposed structure processes both paths in parallel and introduces feature sharing at each intermediate stage between them.
  • Figure 3: Model architecture. The top figure shows the overall architecture of DRIFT, which consists of a point path, a pillar path, and feature sharing blocks. The feature after block $i$: $M_i \times C_i \ \text{in} \ H_i \times W_i$ refers to a sparse pillar representation tensor with shape $M_i \times C_i$, from BEV grid size $H_i \times W_i$, where $M_i$ is the number of non-empty pillars and $C_i$ is the number of channels. (a) Point transformer block. (b) Pillar transformer block. (c1) Feature sharing block with add or concatenation fusion. (c2) Feature sharing block with cross-attention fusion.
  • Figure 4: Visualization results of ground truth, CenterPoint, and DRIFT (proposed) on the VoD validation set. Blue/green/red boxes indicate car, pedestrian, and cyclist detections, respectively. All images are trimmed.
  • Figure 5: Visualization of free-road ground truth and prediction on perciv-scenes-2 validation set on BEV. Yellow segment indicate free-road region.