Table of Contents
Fetching ...

VoxelTrack: Exploring Voxel Representation for 3D Point Cloud Object Tracking

Yuxuan Lu, Jiahao Nie, Zhiwei He, Hongjie Gu, Xudong Lv

TL;DR

VoxelTrack tackles 3D single object tracking on LiDAR point clouds by voxelizing inputs and applying sparse 3D convolutions to preserve precise 3D spatial structure for direct regression. It introduces a dual-stream voxel encoder with cross-iterative feature fusion to capture fine-grained spatial cues, simplifying the tracking pipeline to a single regression loss. The approach achieves state-of-the-art results across KITTI, NuScenes, and Waymo Open Dataset and runs in real time at 36 FPS on a TITAN RTX, demonstrating robustness to sparsity and distractors. This voxel-centric framework offers a practical, high-precision solution for real-world autonomous driving tasks.

Abstract

Current LiDAR point cloud-based 3D single object tracking (SOT) methods typically rely on point-based representation network. Despite demonstrated success, such networks suffer from some fundamental problems: 1) It contains pooling operation to cope with inherently disordered point clouds, hindering the capture of 3D spatial information that is useful for tracking, a regression task. 2) The adopted set abstraction operation hardly handles density-inconsistent point clouds, also preventing 3D spatial information from being modeled. To solve these problems, we introduce a novel tracking framework, termed VoxelTrack. By voxelizing inherently disordered point clouds into 3D voxels and extracting their features via sparse convolution blocks, VoxelTrack effectively models precise and robust 3D spatial information, thereby guiding accurate position prediction for tracked objects. Moreover, VoxelTrack incorporates a dual-stream encoder with cross-iterative feature fusion module to further explore fine-grained 3D spatial information for tracking. Benefiting from accurate 3D spatial information being modeled, our VoxelTrack simplifies tracking pipeline with a single regression loss. Extensive experiments are conducted on three widely-adopted datasets including KITTI, NuScenes and Waymo Open Dataset. The experimental results confirm that VoxelTrack achieves state-of-the-art performance (88.3%, 71.4% and 63.6% mean precision on the three datasets, respectively), and outperforms the existing trackers with a real-time speed of 36 Fps on a single TITAN RTX GPU. The source code and model will be released.

VoxelTrack: Exploring Voxel Representation for 3D Point Cloud Object Tracking

TL;DR

VoxelTrack tackles 3D single object tracking on LiDAR point clouds by voxelizing inputs and applying sparse 3D convolutions to preserve precise 3D spatial structure for direct regression. It introduces a dual-stream voxel encoder with cross-iterative feature fusion to capture fine-grained spatial cues, simplifying the tracking pipeline to a single regression loss. The approach achieves state-of-the-art results across KITTI, NuScenes, and Waymo Open Dataset and runs in real time at 36 FPS on a TITAN RTX, demonstrating robustness to sparsity and distractors. This voxel-centric framework offers a practical, high-precision solution for real-world autonomous driving tasks.

Abstract

Current LiDAR point cloud-based 3D single object tracking (SOT) methods typically rely on point-based representation network. Despite demonstrated success, such networks suffer from some fundamental problems: 1) It contains pooling operation to cope with inherently disordered point clouds, hindering the capture of 3D spatial information that is useful for tracking, a regression task. 2) The adopted set abstraction operation hardly handles density-inconsistent point clouds, also preventing 3D spatial information from being modeled. To solve these problems, we introduce a novel tracking framework, termed VoxelTrack. By voxelizing inherently disordered point clouds into 3D voxels and extracting their features via sparse convolution blocks, VoxelTrack effectively models precise and robust 3D spatial information, thereby guiding accurate position prediction for tracked objects. Moreover, VoxelTrack incorporates a dual-stream encoder with cross-iterative feature fusion module to further explore fine-grained 3D spatial information for tracking. Benefiting from accurate 3D spatial information being modeled, our VoxelTrack simplifies tracking pipeline with a single regression loss. Extensive experiments are conducted on three widely-adopted datasets including KITTI, NuScenes and Waymo Open Dataset. The experimental results confirm that VoxelTrack achieves state-of-the-art performance (88.3%, 71.4% and 63.6% mean precision on the three datasets, respectively), and outperforms the existing trackers with a real-time speed of 36 Fps on a single TITAN RTX GPU. The source code and model will be released.
Paper Structure (17 sections, 7 equations, 6 figures, 6 tables)

This paper contains 17 sections, 7 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Comparison between point-based tracking methods (a) and our voxel-based tracking method (b). The point-based methods include P2B series and M$^2$Track series. P2B series employs appearance matching techniques to generate proposals and verifies one as tracking result, while M$^2$Track series models motion relation for tracking in a two-stage manner. In contrast, our VoxelTrack explores 3D spatial information through voxel-based representation for tracking.
  • Figure 2: Overall of our proposed voxel representation based tracking framework VoxelTrack. It consists of voxel division, multi-level voxel feature learning and box regression components. "CIF" denotes cross-iterative feature fusion module, where the last one performs single-direction fusion from small voxel (high resolution) branch to large voxel (low resolution) branch.
  • Figure 3: Illustration of large voxel and small voxel based inputs for dual-stream encoder. The inputs are denoted by $\textbf{V}_{t-1,t}^{large} \in \mathbb{R}^{W_l\times L_l\times H_l \times 6}$ and $\textbf{V}_{t-1,t}^{small} \in \mathbb{R}^{W_s\times L_s\times H_s \times 6}$, respectively.
  • Figure 4: Illustration of cross-iterative feature fusion. It utilizes a pooling operation to down-sample the large-scale 3D feature maps within the small voxel branch, which are then concatenated with the small-scale 3D feature maps of the large voxel branch. Correspondingly, a linear interpolation operation is employed to fuse feature from the large voxel branch to the small voxel branch.
  • Figure 5: Performance comparison on three types of complex scenes factors. (a) can reflect the robustness to sparse scenes on the Car category. (b) and (c) can reflect the robustness to various distractors on the Pedestrian category. [m, n] denotes the number of point cloud sequences and total frames.
  • ...and 1 more figures