SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory Prediction
Zhongping Dong, Pengyang Yu, Shuangjian Li, Liming Chen, Mohand Tahar Kechadi
TL;DR
SOTFormer tackles the challenge of unified perception for single-object tracking and short-horizon trajectory forecasting under real-world variability by introducing a constant-memory temporal transformer. It combines Ground-Truth-Primed memory initialization, a single lightweight temporal-attention block, and a multi-task unified loss to achieve real-time inference with fixed memory while maintaining high accuracy. Key contributions include a GT-Primed memory mechanism with burn-in anchor loss, a constant-memory temporal block that avoids token growth, and end-to-end optimization for detection, tracking, and forecasting, demonstrated on Mini-LaSOT (20%) with state-of-the-art performance and strong efficiency. The work offers a practical, deployable paradigm for scalable, reproducible transformer tracking that can extend to multi-object and multi-modal sequential tasks.
Abstract
Accurate single-object tracking and short-term motion forecasting remain challenging under occlusion, scale variation, and temporal drift, which disrupt the temporal coherence required for real-time perception. We introduce \textbf{SOTFormer}, a minimal constant-memory temporal transformer that unifies object detection, tracking, and short-horizon trajectory prediction within a single end-to-end framework. Unlike prior models with recurrent or stacked temporal encoders, SOTFormer achieves stable identity propagation through a ground-truth-primed memory and a burn-in anchor loss that explicitly stabilizes initialization. A single lightweight temporal-attention layer refines embeddings across frames, enabling real-time inference with fixed GPU memory. On the Mini-LaSOT (20%) benchmark, SOTFormer attains 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM), outperforming transformer baselines such as TrackFormer and MOTRv2 under fast motion, scale change, and occlusion.
