Table of Contents
Fetching ...

3D Single-object Tracking in Point Clouds with High Temporal Variation

Qiao Wu, Kun Sun, Pei An, Mathieu Salzmann, Yanning Zhang, Jiaqi Yang

TL;DR

This work tackles 3D single-object tracking under high temporal variation by introducing HVTrack, a transformer-based framework augmented with a Relative-Pose-Aware Memory (RPM), Base-Expansion Feature Cross-Attention (BEA), and Contextual Point Guided Self-Attention (CPA). A KITTI-HV dataset is built by varying frame intervals to simulate HV conditions, enabling rigorous evaluation beyond standard smooth-variation benchmarks. HVTrack demonstrates strong gains over state-of-the-art trackers, notably surpassing CXTrack on KITTI-HV and achieving leading performance on Waymo across HV settings, with ablations confirming the value of each module. The approach offers robust tracking in dynamic, cluttered environments and lays groundwork for HV-aware 3D SOT with potential for further optimization and broader backbone support.

Abstract

The high temporal variation of the point clouds is the key challenge of 3D single-object tracking (3D SOT). Existing approaches rely on the assumption that the shape variation of the point clouds and the motion of the objects across neighboring frames are smooth, failing to cope with high temporal variation data. In this paper, we present a novel framework for 3D SOT in point clouds with high temporal variation, called HVTrack. HVTrack proposes three novel components to tackle the challenges in the high temporal variation scenario: 1) A Relative-Pose-Aware Memory module to handle temporal point cloud shape variations; 2) a Base-Expansion Feature Cross-Attention module to deal with similar object distractions in expanded search areas; 3) a Contextual Point Guided Self-Attention module for suppressing heavy background noise. We construct a dataset with high temporal variation (KITTI-HV) by setting different frame intervals for sampling in the KITTI dataset. On the KITTI-HV with 5 frame intervals, our HVTrack surpasses the state-of-the-art tracker CXTracker by 11.3%/15.7% in Success/Precision.

3D Single-object Tracking in Point Clouds with High Temporal Variation

TL;DR

This work tackles 3D single-object tracking under high temporal variation by introducing HVTrack, a transformer-based framework augmented with a Relative-Pose-Aware Memory (RPM), Base-Expansion Feature Cross-Attention (BEA), and Contextual Point Guided Self-Attention (CPA). A KITTI-HV dataset is built by varying frame intervals to simulate HV conditions, enabling rigorous evaluation beyond standard smooth-variation benchmarks. HVTrack demonstrates strong gains over state-of-the-art trackers, notably surpassing CXTrack on KITTI-HV and achieving leading performance on Waymo across HV settings, with ablations confirming the value of each module. The approach offers robust tracking in dynamic, cluttered environments and lays groundwork for HV-aware 3D SOT with potential for further optimization and broader backbone support.

Abstract

The high temporal variation of the point clouds is the key challenge of 3D single-object tracking (3D SOT). Existing approaches rely on the assumption that the shape variation of the point clouds and the motion of the objects across neighboring frames are smooth, failing to cope with high temporal variation data. In this paper, we present a novel framework for 3D SOT in point clouds with high temporal variation, called HVTrack. HVTrack proposes three novel components to tackle the challenges in the high temporal variation scenario: 1) A Relative-Pose-Aware Memory module to handle temporal point cloud shape variations; 2) a Base-Expansion Feature Cross-Attention module to deal with similar object distractions in expanded search areas; 3) a Contextual Point Guided Self-Attention module for suppressing heavy background noise. We construct a dataset with high temporal variation (KITTI-HV) by setting different frame intervals for sampling in the KITTI dataset. On the KITTI-HV with 5 frame intervals, our HVTrack surpasses the state-of-the-art tracker CXTracker by 11.3%/15.7% in Success/Precision.
Paper Structure (18 sections, 12 equations, 9 figures, 17 tables, 1 algorithm)

This paper contains 18 sections, 12 equations, 9 figures, 17 tables, 1 algorithm.

Figures (9)

  • Figure 1: Feature correlation in 3D SOT.(a) Feature correlation in the smooth case (1 frame interval). Correlating the features is relatively trivial as the target undergoes only small shape variations, and the observation angles are consistent in the three frames. (b-c) Feature correlation in high temporal variation cases (10 frames interval). The pose relative to the camera changes rapidly. Correlating the features using historical information is highly challenging (b). We encode the historical observation angles $\alpha$ into the features to guide the variation of relative pose to the camera (c).
  • Figure 2: Comparison of HVTrack with the SOTAs qi_p2b_2020zheng_box-aware_2021zheng_beyond_2022xu2023cxtrack on 'Car' from KITTI-HV (KITTI geiger_are_2012 with different frame intervals, see \ref{['sec:experiments']}).
  • Figure 3: HVTrack framework. We first utilize a backbone to extract the local embedding features of the search area. Then, we construct $L$ transformer layers to fuse spatio-temporal information. For each transformer layer, (i) we apply three memory bank features in the Relative-Pose-Aware Memory module to generate temporal template information; (ii) we employ the Base-Expansion Feature Cross-Attention to correlate the template and search area by leveraging hybrid scale spatial context-aware features; (iii) we introduce a Contextual Point Guided Self-Attention to suppress unimportant noise. After each layer, we update the layer features memory bank using the layer input. Finally, we apply an RPN to regress the 3D bounding box, and update the mask and observation angle memory banks.
  • Figure 4: (a) Base-Expansion Feature Cross-Attention (BEA). The $H$ heads in the multi-head attention (MHA) are split to process hybrid scale features. For the base scale branch, we directly put the local features into the MHA. For the expansion scale branch, we apply an EdgeConv wang2019dynamic to expand the receptive field of each point and extract more abstract features before MHA. BEA captures the spatial context-aware information with a humble extra computational cost. (b) Contextual Point Guided Self-Attention (CPA). We determine the importance of each point by both base and expansion scale attention maps. Then, we aggregate all the points into $U$ clusters (contextual points) according to their importance and project the clusters to K and V. We assign fewer contextual points for low-importance points, and vice versa. CPA not only suppresses the noise but also reduces the computational cost of the attention.
  • Figure 5: The attention maps of 'Van' in CPA.
  • ...and 4 more figures