JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention

Brian Cheong; Jiachen Zhou; Steven Waslander

JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention

Brian Cheong, Jiachen Zhou, Steven Waslander

TL;DR

JDT3D advances LiDAR-based tracking-by-attention by integrating end-to-end joint detection and tracking with two key enhancements: track sampling augmentation to enrich temporal supervision and confidence-based query propagation to align training and inference. On nuScenes, it achieves AMOTA $=0.574$ and AMOTP $=0.837$, outperforming LiDAR-based TBA methods by over $6\%$ in AMOTA and reducing ID switches, while clinical analysis reveals the remaining gap with TBD stems from weaker multi-frame detection and temporal confusion. The work shows that end-to-end JDT with longer temporal contexts and a more capable decoder can substantially improve LiDAR TBA, and it provides a clear analysis of where improvements are most needed. Overall, JDT3D demonstrates the viability of bridging the TBD–TBA gap in LiDAR MOT and offers concrete, generalizable strategies for future LiDAR-based trackers.

Abstract

Tracking-by-detection (TBD) methods achieve state-of-the-art performance on 3D tracking benchmarks for autonomous driving. On the other hand, tracking-by-attention (TBA) methods have the potential to outperform TBD methods, particularly for long occlusions and challenging detection settings. This work investigates why TBA methods continue to lag in performance behind TBD methods using a LiDAR-based joint detector and tracker called JDT3D. Based on this analysis, we propose two generalizable methods to bridge the gap between TBD and TBA methods: track sampling augmentation and confidence-based query propagation. JDT3D is trained and evaluated on the nuScenes dataset, achieving 0.574 on the AMOTA metric on the nuScenes test set, outperforming all existing LiDAR-based TBA approaches by over 6%. Based on our results, we further discuss some potential challenges with the existing TBA model formulation to explain the continued gap in performance with TBD methods. The implementation of JDT3D can be found at the following link: https://github.com/TRAILab/JDT3D.

JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention

TL;DR

and AMOTP

, outperforming LiDAR-based TBA methods by over

in AMOTA and reducing ID switches, while clinical analysis reveals the remaining gap with TBD stems from weaker multi-frame detection and temporal confusion. The work shows that end-to-end JDT with longer temporal contexts and a more capable decoder can substantially improve LiDAR TBA, and it provides a clear analysis of where improvements are most needed. Overall, JDT3D demonstrates the viability of bridging the TBD–TBA gap in LiDAR MOT and offers concrete, generalizable strategies for future LiDAR-based trackers.

Abstract

Paper Structure (27 sections, 2 equations, 4 figures, 9 tables)

This paper contains 27 sections, 2 equations, 4 figures, 9 tables.

Introduction
Related Works
Tracking-by-detection (TBD)
Joint Detection and Tracking (JDT)
Tracking-by-Attention (TBA)
LiDAR Data Augmentations
Method
Overview
Query Propagation
Ground Truth Assignment
Track Sampling Augmentation
Network Training and Losses
Experiments
Dataset and Metrics
Implementation Details
...and 12 more sections

Figures (4)

Figure 1: Illustration of tracking by attention. Each object is represented by a query. In the diagram, the yellow motorcycle leaves the frame, so it is removed from the set of maintained queries. To detect new objects in the scene, such as the tan bus, proposal queries are appended at each time step, represented by the blank squares.
Figure 2: JDT3D Architecture. At each time step, a BEV feature map is extracted and used to initialize a set of proposal queries. The proposal queries are concatenated to track queries passed from the previous frame and used to predict objects in the scene. Track queries detect the same unique objects in each time step, while proposal queries detect untracked or new objects.
Figure 3: An example of track sampling augmentation over three consecutive frames from the nuScenes dataset. The original trajectories and sampled trajectories are shown in blue and orange, respectively. Only a subset of the tracks and the third LiDAR scan are shown for visual clarity. The older boxes are shown with transparency.
Figure 4: An example of temporal confusion, where the decoder must handle the same proposal query differently based on the presence of track queries. In Case 1, the proposal query should be a positive prediction, while in Case 2, it should be a negative prediction.

JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention

TL;DR

Abstract

JDT3D: Addressing the Gaps in LiDAR-Based Tracking-by-Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (4)