Table of Contents
Fetching ...

PillarTrack:Boosting Pillar Representation for Transformer-based 3D Single Object Tracking on Point Clouds

Weisheng Xu, Sifan Zhou, Jiaqi Xiong, Ziyu Zhao, Zhihang Yuan

TL;DR

PillarTrack addresses the information loss in point-based LiDAR 3D SOT by introducing a pillar-based representation that preserves geometry while enabling real-time processing. It pair s a Pyramid-Encoded Pillar Feature Encoder (PE-PFE) with a modality-aware Transformer backbone to enhance pillar features and efficiently capture geometric cues, reorienting computation toward early stages to leverage intrinsic point-cloud structure. The approach yields strong gains on KITTI and competitive performance on nuScenes, outperforming the baseline SMAT and many motion- or similarity-based trackers while delivering higher FPS. The work provides an open-source implementation and emphasizes practical deployment on resource-constrained platforms through reduced GFLOPs and potential quantization. Overall, PillarTrack offers a robust, efficient path for 3D SOT on point clouds with actionable design principles for backbone and feature encoding in pillar-based pipelines.

Abstract

LiDAR-based 3D single object tracking (3D SOT) is a critical issue in robotics and autonomous driving. Existing 3D SOT methods typically adhere to a point-based processing pipeline, wherein the re-sampling operation invariably leads to either redundant or missing information, thereby impacting performance. To address these issues, we propose PillarTrack, a novel pillar-based 3D SOT framework. First, we transform sparse point clouds into dense pillars to preserve the local and global geometrics. Second, we propose a Pyramid-Encoded Pillar Feature Encoder (PE-PFE) design to enhance the robustness of pillar feature for translation/rotation/scale. Third, we present an efficient Transformer-based backbone from the perspective of modality differences. Finally, we construct our PillarTrack based on above designs. Extensive experiments show that our method achieves comparable performance on the KITTI and NuScenes datasets, significantly enhancing the performance of the baseline.

PillarTrack:Boosting Pillar Representation for Transformer-based 3D Single Object Tracking on Point Clouds

TL;DR

PillarTrack addresses the information loss in point-based LiDAR 3D SOT by introducing a pillar-based representation that preserves geometry while enabling real-time processing. It pair s a Pyramid-Encoded Pillar Feature Encoder (PE-PFE) with a modality-aware Transformer backbone to enhance pillar features and efficiently capture geometric cues, reorienting computation toward early stages to leverage intrinsic point-cloud structure. The approach yields strong gains on KITTI and competitive performance on nuScenes, outperforming the baseline SMAT and many motion- or similarity-based trackers while delivering higher FPS. The work provides an open-source implementation and emphasizes practical deployment on resource-constrained platforms through reduced GFLOPs and potential quantization. Overall, PillarTrack offers a robust, efficient path for 3D SOT on point clouds with actionable design principles for backbone and feature encoding in pillar-based pipelines.

Abstract

LiDAR-based 3D single object tracking (3D SOT) is a critical issue in robotics and autonomous driving. Existing 3D SOT methods typically adhere to a point-based processing pipeline, wherein the re-sampling operation invariably leads to either redundant or missing information, thereby impacting performance. To address these issues, we propose PillarTrack, a novel pillar-based 3D SOT framework. First, we transform sparse point clouds into dense pillars to preserve the local and global geometrics. Second, we propose a Pyramid-Encoded Pillar Feature Encoder (PE-PFE) design to enhance the robustness of pillar feature for translation/rotation/scale. Third, we present an efficient Transformer-based backbone from the perspective of modality differences. Finally, we construct our PillarTrack based on above designs. Extensive experiments show that our method achieves comparable performance on the KITTI and NuScenes datasets, significantly enhancing the performance of the baseline.
Paper Structure (12 sections, 6 figures, 5 tables)

This paper contains 12 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Comparison with other 3D SOT methods on KITTI dataset. We classify methods based on the backbone architecture and report performance on Success and Precision.
  • Figure 2: The architecture of our PillarTrack network. Given the template and search area, we first use PE-PFE to extract multi-scale features respectively. Then, the backbone extract features at different feature scale and fuses the multi-scale similarity feature. Finally, we apply the detection head on the feature fusion map to localize the target.
  • Figure 3: The illustration of PE-PFE design. We encode the point coordinate of the input point cloud in a pyramid-like type. This Pyramid-like encoding design allows the network to optimize effectively without the input information loss.
  • Figure 4: Illustration of the properties of PE-PFE. The red indicates the feature distribution after original PFE encoding, while the blue represents the feature distribution after PE-PFE encoding. (a) Original point cloud. (b) Point cloud with 1.2X scale. (c) Point cloud with 1.2m translation. (d) Point cloud with 45$\degree$ rotation.
  • Figure 5: The output features visualization of forth stage in t-SNE.(a) Visualization with the image blocks setting. (b) Visualization matching the total number of the original blocks. (c) Visualization after removing redundancies from the number of blocks.
  • ...and 1 more figures