Table of Contents
Fetching ...

TDFANet: Encoding Sequential 4D Radar Point Clouds Using Trajectory-Guided Deformable Feature Aggregation for Place Recognition

Shouyi Lu, Guirong Zhuo, Haitao Wang, Quan Zhou, Huanyu Zhou, Renbo Huang, Minqing Huang, Lianqing Zheng, Qiang Shu

TL;DR

TDFANet tackles place recognition with sequential 4D radar by combining dynamic point removal, BEV feature encoding, ego-velocity–guided trajectory alignment, and a multi-scale spatio-temporal deformable transformer to aggregate features across time. The approach yields a compact global descriptor via GeM pooling and is trained with a metric-learning objective, achieving state-of-the-art performance on a real multi-sensor radar dataset. Core contributions include a trajectory-guided alignment strategy, a spatio-temporal pyramid deformable architecture, and the first end-to-end framework for sequential 4D radar place recognition, validated under dynamic and long-term appearance changes. This work advances radar-based localization robustness in challenging conditions and provides a dataset and codebase to spur further research.

Abstract

Place recognition is essential for achieving closed-loop or global positioning in autonomous vehicles and mobile robots. Despite recent advancements in place recognition using 2D cameras or 3D LiDAR, it remains to be seen how to use 4D radar for place recognition - an increasingly popular sensor for its robustness against adverse weather and lighting conditions. Compared to LiDAR point clouds, radar data are drastically sparser, noisier and in much lower resolution, which hampers their ability to effectively represent scenes, posing significant challenges for 4D radar-based place recognition. This work addresses these challenges by leveraging multi-modal information from sequential 4D radar scans and effectively extracting and aggregating spatio-temporal features.Our approach follows a principled pipeline that comprises (1) dynamic points removal and ego-velocity estimation from velocity property, (2) bird's eye view (BEV) feature encoding on the refined point cloud, (3) feature alignment using BEV feature map motion trajectory calculated by ego-velocity, (4) multi-scale spatio-temporal features of the aligned BEV feature maps are extracted and aggregated.Real-world experimental results validate the feasibility of the proposed method and demonstrate its robustness in handling dynamic environments. Source codes are available.

TDFANet: Encoding Sequential 4D Radar Point Clouds Using Trajectory-Guided Deformable Feature Aggregation for Place Recognition

TL;DR

TDFANet tackles place recognition with sequential 4D radar by combining dynamic point removal, BEV feature encoding, ego-velocity–guided trajectory alignment, and a multi-scale spatio-temporal deformable transformer to aggregate features across time. The approach yields a compact global descriptor via GeM pooling and is trained with a metric-learning objective, achieving state-of-the-art performance on a real multi-sensor radar dataset. Core contributions include a trajectory-guided alignment strategy, a spatio-temporal pyramid deformable architecture, and the first end-to-end framework for sequential 4D radar place recognition, validated under dynamic and long-term appearance changes. This work advances radar-based localization robustness in challenging conditions and provides a dataset and codebase to spur further research.

Abstract

Place recognition is essential for achieving closed-loop or global positioning in autonomous vehicles and mobile robots. Despite recent advancements in place recognition using 2D cameras or 3D LiDAR, it remains to be seen how to use 4D radar for place recognition - an increasingly popular sensor for its robustness against adverse weather and lighting conditions. Compared to LiDAR point clouds, radar data are drastically sparser, noisier and in much lower resolution, which hampers their ability to effectively represent scenes, posing significant challenges for 4D radar-based place recognition. This work addresses these challenges by leveraging multi-modal information from sequential 4D radar scans and effectively extracting and aggregating spatio-temporal features.Our approach follows a principled pipeline that comprises (1) dynamic points removal and ego-velocity estimation from velocity property, (2) bird's eye view (BEV) feature encoding on the refined point cloud, (3) feature alignment using BEV feature map motion trajectory calculated by ego-velocity, (4) multi-scale spatio-temporal features of the aligned BEV feature maps are extracted and aggregated.Real-world experimental results validate the feasibility of the proposed method and demonstrate its robustness in handling dynamic environments. Source codes are available.

Paper Structure

This paper contains 20 sections, 10 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: TDFANet Overview: Given sequential 4D radar point clouds. First, preprocessing is performed to refine the 4D radar point cloud based on ego-velocity regression and RANSAC filtering. The refined point clouds are then encoded into BEV feature maps. Next, a trajectory-guided feature alignment method is proposed to align BEV feature maps at different time steps. Subsequently, a spatio-temporal pyramid deformable feature aggregation method is proposed to aggregate the aligned BEV feature maps. Finally, the final global descriptor is generated using GeM pooling.
  • Figure 2: Spatio-Temporal Pyramid Deformable Feature Aggregation: A spatio-temporal feature pyramid is built using residual blocks, followed by the introduction of a spatio-temporal pyramid deformable transformer to aggregate these features. We illustrate the deformable feature aggregation process using the reference point $p_q$ as an example.
  • Figure 3: Overview of the dataset we collected. The trajectories in different colors represent data collected during different time periods. Example is given for vegetation and vehicle changes due to the long time span.
  • Figure 4: Challenging query frames and reference frames retrieved by SOTA methods. Even in the presence of dynamic objects in the scene or significant appearance changes, the proposed method can accurately retrieve the top 1 reference frame, demonstrating its robustness and superiority in complex environments. Green means the retrived reference is a true positive, while red denotes false positive.