Table of Contents
Fetching ...

SiamMo: Siamese Motion-Centric 3D Object Tracking

Yuxiang Yang, Yingqi Deng, Jing Zhang, Hongjie Gu, Zhekang Dong

TL;DR

This paper introduces SiamMo, a novel and simple Siamese motion-centric tracking approach that eliminates the need for additional processes like segmentation and box refinement and designs a Spatio-Temporal Feature Aggregation module to integrate Siamese features at multiple scales, capturing motion information effectively.

Abstract

Current 3D single object tracking methods primarily rely on the Siamese matching-based paradigm, which struggles with textureless and incomplete LiDAR point clouds. Conversely, the motion-centric paradigm avoids appearance matching, thus overcoming these issues. However, its complex multi-stage pipeline and the limited temporal modeling capability of a single-stream architecture constrain its potential. In this paper, we introduce SiamMo, a novel and simple Siamese motion-centric tracking approach. Unlike the traditional single-stream architecture, we employ Siamese feature extraction for motion-centric tracking. This decouples feature extraction from temporal fusion, significantly enhancing tracking performance. Additionally, we design a Spatio-Temporal Feature Aggregation module to integrate Siamese features at multiple scales, capturing motion information effectively. We also introduce a Box-aware Feature Encoding module to encode object size priors into motion estimation. SiamMo is a purely motion-centric tracker that eliminates the need for additional processes like segmentation and box refinement. Without whistles and bells, SiamMo not only surpasses state-of-the-art methods across multiple benchmarks but also demonstrates exceptional robustness in challenging scenarios. SiamMo sets a new record on the KITTI tracking benchmark with 90.1\% precision while maintaining a high inference speed of 108 FPS. The code will be released at https://github.com/HDU-VRLab/SiamMo.

SiamMo: Siamese Motion-Centric 3D Object Tracking

TL;DR

This paper introduces SiamMo, a novel and simple Siamese motion-centric tracking approach that eliminates the need for additional processes like segmentation and box refinement and designs a Spatio-Temporal Feature Aggregation module to integrate Siamese features at multiple scales, capturing motion information effectively.

Abstract

Current 3D single object tracking methods primarily rely on the Siamese matching-based paradigm, which struggles with textureless and incomplete LiDAR point clouds. Conversely, the motion-centric paradigm avoids appearance matching, thus overcoming these issues. However, its complex multi-stage pipeline and the limited temporal modeling capability of a single-stream architecture constrain its potential. In this paper, we introduce SiamMo, a novel and simple Siamese motion-centric tracking approach. Unlike the traditional single-stream architecture, we employ Siamese feature extraction for motion-centric tracking. This decouples feature extraction from temporal fusion, significantly enhancing tracking performance. Additionally, we design a Spatio-Temporal Feature Aggregation module to integrate Siamese features at multiple scales, capturing motion information effectively. We also introduce a Box-aware Feature Encoding module to encode object size priors into motion estimation. SiamMo is a purely motion-centric tracker that eliminates the need for additional processes like segmentation and box refinement. Without whistles and bells, SiamMo not only surpasses state-of-the-art methods across multiple benchmarks but also demonstrates exceptional robustness in challenging scenarios. SiamMo sets a new record on the KITTI tracking benchmark with 90.1\% precision while maintaining a high inference speed of 108 FPS. The code will be released at https://github.com/HDU-VRLab/SiamMo.
Paper Structure (31 sections, 5 equations, 7 figures, 7 tables)

This paper contains 31 sections, 5 equations, 7 figures, 7 tables.

Figures (7)

  • Figure 1: Comparison of typical 3D object tracking methods. (a) Siamese matching-based tracker that relies on appearance matching. (b) Motion-centric tracker that requires multiple stages of segmentation and box refinement. (c) Our Siamese motion-centric tracker adopts Siamese architecture to perform motion-centric tracking in an end-to-end simple single-stage manner.
  • Figure 2: Architecture of SiamMo. SiamMo comprises three main blocks. SFE encodes two successive frames into multi-scale BEV feature maps. STFA then integrates the BEV feature maps at multiple scales. Finally, BFE injects the box priors into motion features for prediction.
  • Figure 3: Diagram of Spatio-Temporal Feature Aggregation.
  • Figure 4: Visualization of tracking results by our SiamMo and state-of-the-art methods.
  • Figure 5: Robustness to sparsity. [a, b) is the number of points in the first frame's car. We use the Success as the evaluation metric. The range of the number line is from 0.2 to 0.9.
  • ...and 2 more figures