STT: Stateful Tracking with Transformers for Autonomous Driving
Longlong Jing, Ruichi Yu, Xu Chen, Zhengli Zhao, Shiwei Sheng, Colin Graber, Qi Chen, Qinru Li, Shangxuan Wu, Han Deng, Sangjin Lee, Chris Sweeney, Qiurui He, Wei-Chih Hung, Tong He, Xingyi Zhou, Farshid Moussavi, Zijian Guo, Yin Zhou, Mingxing Tan, Weilong Yang, Congcong Li
TL;DR
STT tackles the challenge of robust 3D multi-object tracking with accurate state estimation for autonomous driving by unifying data association and motion-state prediction in a single transformer-based framework. It introduces a Detection Encoder, Temporal Fusion, a Track State Decoder, and a Track-Detection Interaction module, all trained jointly, and it formalizes stateful metrics S-MOTA and MOTP_S to quantify state accuracy. Empirical results on the Waymo Open Dataset show competitive tracking performance and state estimation, with STT achieving strong S-MOTA/MOTP_S gains and real-time inference when paired with an efficient detector. The work highlights the value of long-term temporal modeling and joint optimization for safety-critical perception tasks, and provides ablations demonstrating the benefits of history length and detector robustness.
Abstract
Tracking objects in three-dimensional space is critical for autonomous driving. To ensure safety while driving, the tracker must be able to reliably track objects across frames and accurately estimate their states such as velocity and acceleration in the present. Existing works frequently focus on the association task while either neglecting the model performance on state estimation or deploying complex heuristics to predict the states. In this paper, we propose STT, a Stateful Tracking model built with Transformers, that can consistently track objects in the scenes while also predicting their states accurately. STT consumes rich appearance, geometry, and motion signals through long term history of detections and is jointly optimized for both data association and state estimation tasks. Since the standard tracking metrics like MOTA and MOTP do not capture the combined performance of the two tasks in the wider spectrum of object states, we extend them with new metrics called S-MOTA and MOTPS that address this limitation. STT achieves competitive real-time performance on the Waymo Open Dataset.
