Table of Contents
Fetching ...

Modeling Continuous Motion for 3D Point Cloud Object Tracking

Zhipeng Luo, Gongjie Zhang, Changqing Zhou, Zhonghua Wu, Qingyi Tao, Lewei Lu, Shijian Lu

TL;DR

A contrastive sequence enhancement strategy is proposed, which uses ground truth tracklets to augment training sequences and promote discrimination against false positives in a contrastive manner, and outperforms the state-of-the-art method by significant margins on multiple benchmarks.

Abstract

The task of 3D single object tracking (SOT) with LiDAR point clouds is crucial for various applications, such as autonomous driving and robotics. However, existing approaches have primarily relied on appearance matching or motion modeling within only two successive frames, thereby overlooking the long-range continuous motion property of objects in 3D space. To address this issue, this paper presents a novel approach that views each tracklet as a continuous stream: at each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank, enabling efficient exploitation of sequential information. To achieve effective cross-frame message passing, a hybrid attention mechanism is designed to account for both long-range relation modeling and local geometric feature extraction. Furthermore, to enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed, which uses ground truth tracklets to augment training sequences and promote discrimination against false positives in a contrastive manner. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art method by significant margins on multiple benchmarks.

Modeling Continuous Motion for 3D Point Cloud Object Tracking

TL;DR

A contrastive sequence enhancement strategy is proposed, which uses ground truth tracklets to augment training sequences and promote discrimination against false positives in a contrastive manner, and outperforms the state-of-the-art method by significant margins on multiple benchmarks.

Abstract

The task of 3D single object tracking (SOT) with LiDAR point clouds is crucial for various applications, such as autonomous driving and robotics. However, existing approaches have primarily relied on appearance matching or motion modeling within only two successive frames, thereby overlooking the long-range continuous motion property of objects in 3D space. To address this issue, this paper presents a novel approach that views each tracklet as a continuous stream: at each timestamp, only the current frame is fed into the network to interact with multi-frame historical features stored in a memory bank, enabling efficient exploitation of sequential information. To achieve effective cross-frame message passing, a hybrid attention mechanism is designed to account for both long-range relation modeling and local geometric feature extraction. Furthermore, to enhance the utilization of multi-frame features for robust tracking, a contrastive sequence enhancement strategy is proposed, which uses ground truth tracklets to augment training sequences and promote discrimination against false positives in a contrastive manner. Extensive experiments demonstrate that the proposed method outperforms the state-of-the-art method by significant margins on multiple benchmarks.
Paper Structure (17 sections, 8 equations, 8 figures, 5 tables)

This paper contains 17 sections, 8 equations, 8 figures, 5 tables.

Figures (8)

  • Figure 1: Comparison of 3D single object tracking paradigms. (a) The matching-based paradigm extracts features from a cropped template and a search region, and object localization is performed via appearance matching. (b) The motion-centric paradigm takes concatenated point cloud frames as input and estimates relative motion based on segmented objects. (c) Our proposed StreamTrack only takes the current frame as input, while historical features are fetched from a memory bank, allowing for the exploitation of multi-frame continuous motion for robust tracking.
  • Figure 2: Overall architecture of StreamTrack. StreamTrack consists of three modules: memory-assisted feature extraction, spatial-temporal relation modeling, and query-based prediction. At timestamp $t$, StreamTrack only takes as input the current frame $P_t$, while historical features and box predictions are fetched from a memory bank for efficient computation. A Transformer encoder-decoder architecture is adopted for cross-frame message passing and the generation of tracking predictions.
  • Figure 3: Architecture of the proposed hybrid attention. A local spatial attention module is introduced to work in parallel with global spatial-temporal attention to account for both local feature extraction and long-term relation modeling to achieve more effective cross-frame message passing.
  • Figure 4: Illustration of contrastive sequence enhancement. Blue points denote the original target object in a tracking sequence, and orange points represent the added tracklet which serves as a negative sample. An auxiliary contrastive loss is applied to further promote discrimination.
  • Figure 5: Visualization of tracking predictions on a Pedestrian sequence in which distractors exist. When $n=1$, StreamTrack only relies on one historical frame, which is similar to the existing motion-centric paradigm zheng2022beyond. It demonstrates that the exploitation of multi-frame continuous motion improves the tracking robustness effectively.
  • ...and 3 more figures