Detection Is Tracking: Point Cloud Multi-Sweep Deep Learning Models Revisited

Lingji Chen

Detection Is Tracking: Point Cloud Multi-Sweep Deep Learning Models Revisited

Lingji Chen

TL;DR

This work addresses the temporal aspect of multi-sweep lidar data by proposing MULSPAD, a detector that outputs a pair of bounding boxes for each object: one at the current time and one at the begin time of the input buffer. Built on the VoxelNeXT framework, MULSPAD extends the voxel-based representation to six sweeps, aggregates time-staggered features in BEV, and adds an extra stage to enlarge the receptive field, yielding a unified heatmap and object-type regression. Begin/end times $t_b$ and $t_e$ are derived from ground-truth tracks within the six-sweep window, with birth/death events handled via artificial markers to accommodate occlusions and births. The tracking component uses a baseline RFS tracker with simple likelihood models that leverage paired detections to improve data association and robustness to motion models. Preliminary experiments on the Waymo Open Dataset show feasibility, achieving a MOTA/L2 of approximately $0.577$ for near-range vehicles, and point toward further ablations to quantify gains from the paired-detections approach and the added temporal information.

Abstract

Conventional tracking paradigm takes in instantaneous measurements such as range and bearing, and produces object tracks across time. In applications such as autonomous driving, lidar measurements in the form of point clouds are usually passed through a "virtual sensor" realized by a deep learning model, to produce "measurements" such as bounding boxes, which are in turn ingested by a tracking module to produce object tracks. Very often multiple lidar sweeps are accumulated in a buffer to merge and become the input to the virtual sensor. We argue in this paper that such an input already contains temporal information, and therefore the virtual sensor output should also contain temporal information, not just instantaneous values for the time corresponding to the end of the buffer. In particular, we present the deep learning model called MULti-Sweep PAired Detector (MULSPAD) that produces, for each detected object, a pair of bounding boxes at both the end time and the beginning time of the input buffer. This is achieved with fairly straightforward changes in commonly used lidar detection models, and with only marginal extra processing, but the resulting symmetry is satisfying. Such paired detections make it possible not only to construct rudimentary trackers fairly easily, but also to construct more sophisticated trackers that can exploit the extra information conveyed by the pair and be robust to choices of motion models and object birth/death models. We have conducted preliminary training and experimentation using Waymo Open Dataset, which shows the efficacy of our proposed method.

Detection Is Tracking: Point Cloud Multi-Sweep Deep Learning Models Revisited

TL;DR

and

are derived from ground-truth tracks within the six-sweep window, with birth/death events handled via artificial markers to accommodate occlusions and births. The tracking component uses a baseline RFS tracker with simple likelihood models that leverage paired detections to improve data association and robustness to motion models. Preliminary experiments on the Waymo Open Dataset show feasibility, achieving a MOTA/L2 of approximately

for near-range vehicles, and point toward further ablations to quantify gains from the paired-detections approach and the added temporal information.

Abstract

Paper Structure (13 sections, 15 equations, 9 figures)

This paper contains 13 sections, 15 equations, 9 figures.

Introduction
Related Work
Detection
Architecture
Begin and end times
Tracking
Random Finite Set (RFS) tracker, special case
Likelihood models
Static and slow moving objects
Moving objects
Some tracking results
Results and Ablation Studies
Conclusions

Figures (9)

Figure 1: With 6 lidar sweeps, detections are obtained in pairs: a green bounding box at the current sweep indexed by 0, and a red bounding box at the past sweep indexed by -5. Left: a moving car; middle: a parked car; right: two pedestrians walking in opposite directions.
Figure 2: Architecture of MULSPAD, motivated by chen2023voxelnext.
Figure 3: In Waymo Open Dataset, ground truth track IDs such as 3uxz-RltuMwlmgDehBaZfA are stored in the field id inside Label.
Figure 4: Define the begin time $t_b$ and the end time $t_e$ of a ground truth track in the 6-sweep buffer. Most cases fall into (a) while some into variations of (b); the complication of "breakage" is ignored in this paper. To distinguish from the "singleton" case in (e), a "birth time target" is used for (c), and a "death time target" is used for (d).
Figure 5: Two pedestrian tracks with their constituent detections. The pair has the same color, and Detection ID is printed in the center of the solid bounding box.
...and 4 more figures

Detection Is Tracking: Point Cloud Multi-Sweep Deep Learning Models Revisited

TL;DR

Abstract

Detection Is Tracking: Point Cloud Multi-Sweep Deep Learning Models Revisited

Authors

TL;DR

Abstract

Table of Contents

Figures (9)