Table of Contents
Fetching ...

STT: Stateful Tracking with Transformers for Autonomous Driving

Longlong Jing, Ruichi Yu, Xu Chen, Zhengli Zhao, Shiwei Sheng, Colin Graber, Qi Chen, Qinru Li, Shangxuan Wu, Han Deng, Sangjin Lee, Chris Sweeney, Qiurui He, Wei-Chih Hung, Tong He, Xingyi Zhou, Farshid Moussavi, Zijian Guo, Yin Zhou, Mingxing Tan, Weilong Yang, Congcong Li

TL;DR

STT tackles the challenge of robust 3D multi-object tracking with accurate state estimation for autonomous driving by unifying data association and motion-state prediction in a single transformer-based framework. It introduces a Detection Encoder, Temporal Fusion, a Track State Decoder, and a Track-Detection Interaction module, all trained jointly, and it formalizes stateful metrics S-MOTA and MOTP_S to quantify state accuracy. Empirical results on the Waymo Open Dataset show competitive tracking performance and state estimation, with STT achieving strong S-MOTA/MOTP_S gains and real-time inference when paired with an efficient detector. The work highlights the value of long-term temporal modeling and joint optimization for safety-critical perception tasks, and provides ablations demonstrating the benefits of history length and detector robustness.

Abstract

Tracking objects in three-dimensional space is critical for autonomous driving. To ensure safety while driving, the tracker must be able to reliably track objects across frames and accurately estimate their states such as velocity and acceleration in the present. Existing works frequently focus on the association task while either neglecting the model performance on state estimation or deploying complex heuristics to predict the states. In this paper, we propose STT, a Stateful Tracking model built with Transformers, that can consistently track objects in the scenes while also predicting their states accurately. STT consumes rich appearance, geometry, and motion signals through long term history of detections and is jointly optimized for both data association and state estimation tasks. Since the standard tracking metrics like MOTA and MOTP do not capture the combined performance of the two tasks in the wider spectrum of object states, we extend them with new metrics called S-MOTA and MOTPS that address this limitation. STT achieves competitive real-time performance on the Waymo Open Dataset.

STT: Stateful Tracking with Transformers for Autonomous Driving

TL;DR

STT tackles the challenge of robust 3D multi-object tracking with accurate state estimation for autonomous driving by unifying data association and motion-state prediction in a single transformer-based framework. It introduces a Detection Encoder, Temporal Fusion, a Track State Decoder, and a Track-Detection Interaction module, all trained jointly, and it formalizes stateful metrics S-MOTA and MOTP_S to quantify state accuracy. Empirical results on the Waymo Open Dataset show competitive tracking performance and state estimation, with STT achieving strong S-MOTA/MOTP_S gains and real-time inference when paired with an efficient detector. The work highlights the value of long-term temporal modeling and joint optimization for safety-critical perception tasks, and provides ablations demonstrating the benefits of history length and detector robustness.

Abstract

Tracking objects in three-dimensional space is critical for autonomous driving. To ensure safety while driving, the tracker must be able to reliably track objects across frames and accurately estimate their states such as velocity and acceleration in the present. Existing works frequently focus on the association task while either neglecting the model performance on state estimation or deploying complex heuristics to predict the states. In this paper, we propose STT, a Stateful Tracking model built with Transformers, that can consistently track objects in the scenes while also predicting their states accurately. STT consumes rich appearance, geometry, and motion signals through long term history of detections and is jointly optimized for both data association and state estimation tasks. Since the standard tracking metrics like MOTA and MOTP do not capture the combined performance of the two tasks in the wider spectrum of object states, we extend them with new metrics called S-MOTA and MOTPS that address this limitation. STT achieves competitive real-time performance on the Waymo Open Dataset.
Paper Structure (18 sections, 13 equations, 2 figures, 3 tables)

This paper contains 18 sections, 13 equations, 2 figures, 3 tables.

Figures (2)

  • Figure 1: Illustration of S-MOTA metric. MOTA MOT16 only considers IoUs in label-prediction matching, and does not reveal state errors (e.g., velocity error shown in the figure). This limitation is addressed by S-MOTA via an additional thresholding step to assess the accuracy of predicted state.
  • Figure 2: Overview of STT. We first use the Detection Encoder to encode all of the 3D detections and extract temporal features for each track. The temporal features are fed into the Track-Detection Interaction module to aggregate information from surrounding detections and produce association scores and predicted states for each track. The Track State Decoder also takes the temporal features to produce track states in the previous frame $t-1$. All modules are jointly optimized.