Table of Contents
Fetching ...

SeqTrack3D: Exploring Sequence Information for Robust 3D Point Cloud Tracking

Yu Lin, Zhiheng Li, Yubo Cui, Zheng Fang

TL;DR

SeqTrack3D introduces a Seq2Seq framework for robust 3D single object tracking by jointly modeling sequences of point clouds and bounding boxes. It employs a Transformer-based encoder–decoder with a decoupled local-global feature encoding strategy to capture both spatial geometry and inter-frame motion, guided by historical boxes. The method achieves state-of-the-art results on NuScenes and Waymo, demonstrating enhanced robustness in sparse point scenarios while maintaining efficient runtime. The work highlights the value of sequence-level supervision for continuous 3D tracking and provides code publicly for reproducibility.

Abstract

3D single object tracking (SOT) is an important and challenging task for the autonomous driving and mobile robotics. Most existing methods perform tracking between two consecutive frames while ignoring the motion patterns of the target over a series of frames, which would cause performance degradation in the scenes with sparse points. To break through this limitation, we introduce Sequence-to-Sequence tracking paradigm and a tracker named SeqTrack3D to capture target motion across continuous frames. Unlike previous methods that primarily adopted three strategies: matching two consecutive point clouds, predicting relative motion, or utilizing sequential point clouds to address feature degradation, our SeqTrack3D combines both historical point clouds and bounding box sequences. This novel method ensures robust tracking by leveraging location priors from historical boxes, even in scenes with sparse points. Extensive experiments conducted on large-scale datasets show that SeqTrack3D achieves new state-of-the-art performances, improving by 6.00% on NuScenes and 14.13% on Waymo dataset. The code will be made public at https://github.com/aron-lin/seqtrack3d.

SeqTrack3D: Exploring Sequence Information for Robust 3D Point Cloud Tracking

TL;DR

SeqTrack3D introduces a Seq2Seq framework for robust 3D single object tracking by jointly modeling sequences of point clouds and bounding boxes. It employs a Transformer-based encoder–decoder with a decoupled local-global feature encoding strategy to capture both spatial geometry and inter-frame motion, guided by historical boxes. The method achieves state-of-the-art results on NuScenes and Waymo, demonstrating enhanced robustness in sparse point scenarios while maintaining efficient runtime. The work highlights the value of sequence-level supervision for continuous 3D tracking and provides code publicly for reproducibility.

Abstract

3D single object tracking (SOT) is an important and challenging task for the autonomous driving and mobile robotics. Most existing methods perform tracking between two consecutive frames while ignoring the motion patterns of the target over a series of frames, which would cause performance degradation in the scenes with sparse points. To break through this limitation, we introduce Sequence-to-Sequence tracking paradigm and a tracker named SeqTrack3D to capture target motion across continuous frames. Unlike previous methods that primarily adopted three strategies: matching two consecutive point clouds, predicting relative motion, or utilizing sequential point clouds to address feature degradation, our SeqTrack3D combines both historical point clouds and bounding box sequences. This novel method ensures robust tracking by leveraging location priors from historical boxes, even in scenes with sparse points. Extensive experiments conducted on large-scale datasets show that SeqTrack3D achieves new state-of-the-art performances, improving by 6.00% on NuScenes and 14.13% on Waymo dataset. The code will be made public at https://github.com/aron-lin/seqtrack3d.
Paper Structure (15 sections, 11 equations, 4 figures, 5 tables)

This paper contains 15 sections, 11 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The comparison of tracking paradigms. (a) Two-to-One paradigm exploits two frames to locate target through appearance matching or motion prediction. (b) Sequence-to-One approach uses point clouds in multi-frames to integrate the target information at different times. (c) Our Sequence-to-Sequence paradigm considers temporal clues of the target in points and box sequences to overcome sparse points cases.
  • Figure 2: (a) Overview of SeqTrack3D tracker. The encoder establishes spatial-temporal relations for point sequence. Guided by object sequence with target prior, the decoder generates box embeddings and utilizes them to predict sequential bounding boxes. (b) Details of the local-global encoder that encodes point sequence in a decoupled manner.
  • Figure 3: Tracking performance across varying numbers of template points in the first frame.
  • Figure 4: Visualization of tracking results on NuScenes. Complete tracking trajectories are projected onto the X-Y plane of the global coordinate system. Left: Point sparsity cases; Right: Dense cases; Top: Car category; Bottom: Pedestrian category.