Table of Contents
Fetching ...

Spatio-Temporal Bi-directional Cross-frame Memory for Distractor Filtering Point Cloud Single Object Tracking

Shaoyu Sun, Chunyang Wang, Xuelian Liu, Chunhao Shi, Yueyang Ding, Guan Xi

TL;DR

An innovative spatio-temporal bi-directional cross-frame distractor filtering tracker that bolsters the efficiency and precision of object localization, thereby reducing tracking errors caused by distractors and significantly surpasses the current state-of-the-art methods.

Abstract

3D single object tracking within LIDAR point clouds is a pivotal task in computer vision, with profound implications for autonomous driving and robotics. However, existing methods, which depend solely on appearance matching via Siamese networks or utilize motion information from successive frames, encounter significant challenges. Issues such as similar objects nearby or occlusions can result in tracker drift. To mitigate these challenges, we design an innovative spatio-temporal bi-directional cross-frame distractor filtering tracker, named STMD-Tracker. Our first step involves the creation of a 4D multi-frame spatio-temporal graph convolution backbone. This design separates KNN graph spatial embedding and incorporates 1D temporal convolution, effectively capturing temporal fluctuations and spatio-temporal information. Subsequently, we devise a novel bi-directional cross-frame memory procedure. This integrates future and synthetic past frame memory to enhance the current memory, thereby improving the accuracy of iteration-based tracking. This iterative memory update mechanism allows our tracker to dynamically compensate for information in the current frame, effectively reducing tracker drift. Lastly, we construct spatially reliable Gaussian masks on the fused features to eliminate distractor points. This is further supplemented by an object-aware sampling strategy, which bolsters the efficiency and precision of object localization, thereby reducing tracking errors caused by distractors. Our extensive experiments on KITTI, NuScenes and Waymo datasets demonstrate that our approach significantly surpasses the current state-of-the-art methods.

Spatio-Temporal Bi-directional Cross-frame Memory for Distractor Filtering Point Cloud Single Object Tracking

TL;DR

An innovative spatio-temporal bi-directional cross-frame distractor filtering tracker that bolsters the efficiency and precision of object localization, thereby reducing tracking errors caused by distractors and significantly surpasses the current state-of-the-art methods.

Abstract

3D single object tracking within LIDAR point clouds is a pivotal task in computer vision, with profound implications for autonomous driving and robotics. However, existing methods, which depend solely on appearance matching via Siamese networks or utilize motion information from successive frames, encounter significant challenges. Issues such as similar objects nearby or occlusions can result in tracker drift. To mitigate these challenges, we design an innovative spatio-temporal bi-directional cross-frame distractor filtering tracker, named STMD-Tracker. Our first step involves the creation of a 4D multi-frame spatio-temporal graph convolution backbone. This design separates KNN graph spatial embedding and incorporates 1D temporal convolution, effectively capturing temporal fluctuations and spatio-temporal information. Subsequently, we devise a novel bi-directional cross-frame memory procedure. This integrates future and synthetic past frame memory to enhance the current memory, thereby improving the accuracy of iteration-based tracking. This iterative memory update mechanism allows our tracker to dynamically compensate for information in the current frame, effectively reducing tracker drift. Lastly, we construct spatially reliable Gaussian masks on the fused features to eliminate distractor points. This is further supplemented by an object-aware sampling strategy, which bolsters the efficiency and precision of object localization, thereby reducing tracking errors caused by distractors. Our extensive experiments on KITTI, NuScenes and Waymo datasets demonstrate that our approach significantly surpasses the current state-of-the-art methods.
Paper Structure (15 sections, 14 equations, 6 figures, 7 tables)

This paper contains 15 sections, 14 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: Comparison of 3D single object tracker pattern. (a) Matching-based tracker utilize appearance matching to locate target. (b) Motion-based method use segmentation to capture motion pattern to predict the subsequent movement. (c) Our approach multi-frame spatio-temporal backbone with temporal convolution and cross-frame memory compensate for lost target information, utilizing Gaussian mask to filter out distractor for accurate localization.
  • Figure 2: The overall architecture of STMD-Tracker. It comprises three main blocks, processing N frames through a 4D spatio-temporal graph convolution backbone. Initially, we utilize a 3D graph convolution backbone to embed spatial features. Subsequently, we apply 1D temporal convolution to embed multi-frame spatial features. These features are then fed into a bi-directional cross-frame memory module, which compensates for missing information in the current frame, thereby mitigating tracker drift. Finally, we employ a distractor filtering RPN to eliminate false negative proposals, enhancing the accuracy of the tracking process.
  • Figure 3: Temporal Convolution on Multi-Frame Spatial Feature Maps. Each frame constructs a KNN graph to aggregate neighboring features for representing node characteristics. After embedding the spatial features of each frame, a 1D temporal convolution is applied along the temporal dimension. Purple and blue denote the replication of the first and last frames, respectively, serving as padding for the convolution. With a kernel size of 3 and stride of 1, the convolution of 8 frames captures both short-term motion and long-term tracking information.
  • Figure 4: Bi-directional cross-frame memory module. In Case 1, where the target in frame t has lost information due to occlusion. By inputting the subsequent frame (t+1) and the previous frame (t-1) into the bi-directional cross-frame memory module, which can compensate target information. In Case 2, when the target in frame t is unoccluded, the same bi-directional memory mechanism takes frames t+1 and t-1 to augment the target's correlation feature. Our method can eliminate tracker drift and improve tracking performance.
  • Figure 5: Gaussian mask filter distractor point. The red points, representing distractor points, can be filtered out using a Gaussian mask. The incorrect prediction is marked with blank box, and correct prediction are marked with green box.
  • ...and 1 more figures