Table of Contents
Fetching ...

P2P: Part-to-Part Motion Cues Guide a Strong Tracking Framework for LiDAR Point Clouds

Jiahao Nie, Fei Xie, Sifan Zhou, Xueyi Zhou, Dong-Kyu Chae, Zhiwei He

TL;DR

The paper tackles 3D single object tracking on LiDAR by moving beyond appearance-based methods to a motion-centric paradigm. It introduces P2P, a part-to-part motion modeling framework that directly infers relative target motion from consecutive frames, using both point-based (P2P-point) and voxel-based (P2P-voxel) representations. By fusing corresponding parts across two-frame spatial structures, P2P produces fine-grained motion cues and achieves state-of-the-art results on KITTI, NuScenes, and Waymo Open Dataset with high efficiency (up to ~107 FPS). The work demonstrates strong generalization and robustness, provides extensive ablations, and positions P2P as a solid baseline for future LiDAR-based 3D SOT research.

Abstract

3D single object tracking (SOT) methods based on appearance matching has long suffered from insufficient appearance information incurred by incomplete, textureless and semantically deficient LiDAR point clouds. While motion paradigm exploits motion cues instead of appearance matching for tracking, it incurs complex multi-stage processing and segmentation module. In this paper, we first provide in-depth explorations on motion paradigm, which proves that (\textbf{i}) it is feasible to directly infer target relative motion from point clouds across consecutive frames; (\textbf{ii}) fine-grained information comparison between consecutive point clouds facilitates target motion modeling. We thereby propose to perform part-to-part motion modeling for consecutive point clouds and introduce a novel tracking framework, termed \textbf{P2P}. The novel framework fuses each corresponding part information between consecutive point clouds, effectively exploring detailed information changes and thus modeling accurate target-related motion cues. Following this framework, we present P2P-point and P2P-voxel models, incorporating implicit and explicit part-to-part motion modeling by point- and voxel-based representation, respectively. Without bells and whistles, P2P-voxel sets a new state-of-the-art performance ($\sim$\textbf{89\%}, \textbf{72\%} and \textbf{63\%} precision on KITTI, NuScenes and Waymo Open Dataset, respectively). Moreover, under the same point-based representation, P2P-point outperforms the previous motion tracker M$^2$Track by \textbf{3.3\%} and \textbf{6.7\%} on the KITTI and NuScenes, while running at a considerably high speed of \textbf{107 Fps} on a single RTX3090 GPU. The source code and pre-trained models are available at https://github.com/haooozi/P2P.

P2P: Part-to-Part Motion Cues Guide a Strong Tracking Framework for LiDAR Point Clouds

TL;DR

The paper tackles 3D single object tracking on LiDAR by moving beyond appearance-based methods to a motion-centric paradigm. It introduces P2P, a part-to-part motion modeling framework that directly infers relative target motion from consecutive frames, using both point-based (P2P-point) and voxel-based (P2P-voxel) representations. By fusing corresponding parts across two-frame spatial structures, P2P produces fine-grained motion cues and achieves state-of-the-art results on KITTI, NuScenes, and Waymo Open Dataset with high efficiency (up to ~107 FPS). The work demonstrates strong generalization and robustness, provides extensive ablations, and positions P2P as a solid baseline for future LiDAR-based 3D SOT research.

Abstract

3D single object tracking (SOT) methods based on appearance matching has long suffered from insufficient appearance information incurred by incomplete, textureless and semantically deficient LiDAR point clouds. While motion paradigm exploits motion cues instead of appearance matching for tracking, it incurs complex multi-stage processing and segmentation module. In this paper, we first provide in-depth explorations on motion paradigm, which proves that (\textbf{i}) it is feasible to directly infer target relative motion from point clouds across consecutive frames; (\textbf{ii}) fine-grained information comparison between consecutive point clouds facilitates target motion modeling. We thereby propose to perform part-to-part motion modeling for consecutive point clouds and introduce a novel tracking framework, termed \textbf{P2P}. The novel framework fuses each corresponding part information between consecutive point clouds, effectively exploring detailed information changes and thus modeling accurate target-related motion cues. Following this framework, we present P2P-point and P2P-voxel models, incorporating implicit and explicit part-to-part motion modeling by point- and voxel-based representation, respectively. Without bells and whistles, P2P-voxel sets a new state-of-the-art performance (\textbf{89\%}, \textbf{72\%} and \textbf{63\%} precision on KITTI, NuScenes and Waymo Open Dataset, respectively). Moreover, under the same point-based representation, P2P-point outperforms the previous motion tracker MTrack by \textbf{3.3\%} and \textbf{6.7\%} on the KITTI and NuScenes, while running at a considerably high speed of \textbf{107 Fps} on a single RTX3090 GPU. The source code and pre-trained models are available at https://github.com/haooozi/P2P.
Paper Structure (20 sections, 11 equations, 13 figures, 12 tables)

This paper contains 20 sections, 11 equations, 13 figures, 12 tables.

Figures (13)

  • Figure 1: Comparison with state-of-the-art methods. We visualize mean success performance across all categories on KITTI dataset kitti with respect to floating-point operations per second (FLOPs). P2P-point and P2P-voxel indicate the proposed tracking models with point-based representation and voxel-based representation, respectively.
  • Figure 2: Preliminary investigation on KITTI dataset kitti. We conduct a series of experiments of (a), (b) and (c) that denoted by circles with different colors [${\color{_version1}\CIRCLE}$,${\color{_version2}\CIRCLE}$,${\color{_version3}\CIRCLE}$].
  • Figure 3: Framework of the existing motion paradigm m2track. M$^2$Track is the first motion tracker for 3D single object tracking with auxiliary components, such as segmentation module and motion refinement module.
  • Figure 4: Framework of the proposed P2P. P2P is an end-to-end framework for 3D single object tracking without any auxiliary components. It is only composed of a feature extractor, a part-to-part motion modeling module and a motion prediction head.
  • Figure 5: Illustration of part-to-part motion modeling module. It consists of a part to-part fusion operation and $L$ motion modeling layers, with each layer dedicated to feature learning at both spatial and channel levels.
  • ...and 8 more figures