Table of Contents
Fetching ...

EasyTrack: Efficient and Compact One-stream 3D Point Clouds Tracker

Baojie Fan, Wuyang Zhou, Kai Wang, Shijun Zhou, Fengyu Xu, Jiandong Tian

TL;DR

EasyTrack addresses the inefficiency of existing 3D single-object trackers by introducing a one-stream transformer that fuses template and search information into target-aware features and regresses the target pose in BEV space. It combines a dedicated 3D tracking pre-training stage with masked point modeling, a unified target-aware feature learning module, and an efficient BEV-based localization head; EasyTrack++ adds a center points interaction strategy to further suppress background noise. The approach achieves state-of-the-art results on KITTI, nuScenes, and Waymo while running in real time (52.6 FPS) with a compact footprint (~1.3M parameters). This work advances real-time, accurate 3D tracking in autonomous driving and lays groundwork for future multi-modal extensions that incorporate visual data.

Abstract

Most of 3D single object trackers (SOT) in point clouds follow the two-stream multi-stage 3D Siamese or motion tracking paradigms, which process the template and search area point clouds with two parallel branches, built on supervised point cloud backbones. In this work, beyond typical 3D Siamese or motion tracking, we propose a neat and compact one-stream transformer 3D SOT paradigm from the novel perspective, termed as \textbf{EasyTrack}, which consists of three special designs: 1) A 3D point clouds tracking feature pre-training module is developed to exploit the masked autoencoding for learning 3D point clouds tracking representations. 2) A unified 3D tracking feature learning and fusion network is proposed to simultaneously learns target-aware 3D features, and extensively captures mutual correlation through the flexible self-attention mechanism. 3) A target location network in the dense bird's eye view (BEV) feature space is constructed for target classification and regression. Moreover, we develop an enhanced version named EasyTrack++, which designs the center points interaction (CPI) strategy to reduce the ambiguous targets caused by the noise point cloud background information. The proposed EasyTrack and EasyTrack++ set a new state-of-the-art performance ($\textbf{18\%}$, $\textbf{40\%}$ and $\textbf{3\%}$ success gains) in KITTI, NuScenes, and Waymo while runing at \textbf{52.6fps} with few parameters (\textbf{1.3M}). The code will be available at https://github.com/KnightApple427/Easytrack.

EasyTrack: Efficient and Compact One-stream 3D Point Clouds Tracker

TL;DR

EasyTrack addresses the inefficiency of existing 3D single-object trackers by introducing a one-stream transformer that fuses template and search information into target-aware features and regresses the target pose in BEV space. It combines a dedicated 3D tracking pre-training stage with masked point modeling, a unified target-aware feature learning module, and an efficient BEV-based localization head; EasyTrack++ adds a center points interaction strategy to further suppress background noise. The approach achieves state-of-the-art results on KITTI, nuScenes, and Waymo while running in real time (52.6 FPS) with a compact footprint (~1.3M parameters). This work advances real-time, accurate 3D tracking in autonomous driving and lays groundwork for future multi-modal extensions that incorporate visual data.

Abstract

Most of 3D single object trackers (SOT) in point clouds follow the two-stream multi-stage 3D Siamese or motion tracking paradigms, which process the template and search area point clouds with two parallel branches, built on supervised point cloud backbones. In this work, beyond typical 3D Siamese or motion tracking, we propose a neat and compact one-stream transformer 3D SOT paradigm from the novel perspective, termed as \textbf{EasyTrack}, which consists of three special designs: 1) A 3D point clouds tracking feature pre-training module is developed to exploit the masked autoencoding for learning 3D point clouds tracking representations. 2) A unified 3D tracking feature learning and fusion network is proposed to simultaneously learns target-aware 3D features, and extensively captures mutual correlation through the flexible self-attention mechanism. 3) A target location network in the dense bird's eye view (BEV) feature space is constructed for target classification and regression. Moreover, we develop an enhanced version named EasyTrack++, which designs the center points interaction (CPI) strategy to reduce the ambiguous targets caused by the noise point cloud background information. The proposed EasyTrack and EasyTrack++ set a new state-of-the-art performance (, and success gains) in KITTI, NuScenes, and Waymo while runing at \textbf{52.6fps} with few parameters (\textbf{1.3M}). The code will be available at https://github.com/KnightApple427/Easytrack.
Paper Structure (16 sections, 17 equations, 13 figures, 11 tables)

This paper contains 16 sections, 17 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Comparison of different 3D point clouds tracking framework. (a) The framework of typical Siamese trackers. (b) The framework of the proposed EasyTrack.
  • Figure 2: The 3D tracking pre-training strategy. We perform weight transfer on the transformer blocks in the target-aware network.
  • Figure 3: The network structure of EasyTrack. It is mainly composed of two parts: (1) Joint feature extraction and fusion network for template and search area points feature learning and fusing. (2) Target location network for classification and regression in the BEV feature space.
  • Figure 4: The detailed structure of the proposed local embedding network. Ball Query and MLP layers are applied to capture the local features of point clouds.
  • Figure 5: Illustration of the target-aware 3D feature learning. (a) The detailed structure of the transformer layer. (b) The attention maps generated by our target-aware point cloud feature learning network in the Car and Pedestrian category in the KITTI dataset.
  • ...and 8 more figures