Table of Contents
Fetching ...

SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory Prediction

Zhongping Dong, Pengyang Yu, Shuangjian Li, Liming Chen, Mohand Tahar Kechadi

TL;DR

SOTFormer tackles the challenge of unified perception for single-object tracking and short-horizon trajectory forecasting under real-world variability by introducing a constant-memory temporal transformer. It combines Ground-Truth-Primed memory initialization, a single lightweight temporal-attention block, and a multi-task unified loss to achieve real-time inference with fixed memory while maintaining high accuracy. Key contributions include a GT-Primed memory mechanism with burn-in anchor loss, a constant-memory temporal block that avoids token growth, and end-to-end optimization for detection, tracking, and forecasting, demonstrated on Mini-LaSOT (20%) with state-of-the-art performance and strong efficiency. The work offers a practical, deployable paradigm for scalable, reproducible transformer tracking that can extend to multi-object and multi-modal sequential tasks.

Abstract

Accurate single-object tracking and short-term motion forecasting remain challenging under occlusion, scale variation, and temporal drift, which disrupt the temporal coherence required for real-time perception. We introduce \textbf{SOTFormer}, a minimal constant-memory temporal transformer that unifies object detection, tracking, and short-horizon trajectory prediction within a single end-to-end framework. Unlike prior models with recurrent or stacked temporal encoders, SOTFormer achieves stable identity propagation through a ground-truth-primed memory and a burn-in anchor loss that explicitly stabilizes initialization. A single lightweight temporal-attention layer refines embeddings across frames, enabling real-time inference with fixed GPU memory. On the Mini-LaSOT (20%) benchmark, SOTFormer attains 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM), outperforming transformer baselines such as TrackFormer and MOTRv2 under fast motion, scale change, and occlusion.

SOTFormer: A Minimal Transformer for Unified Object Tracking and Trajectory Prediction

TL;DR

SOTFormer tackles the challenge of unified perception for single-object tracking and short-horizon trajectory forecasting under real-world variability by introducing a constant-memory temporal transformer. It combines Ground-Truth-Primed memory initialization, a single lightweight temporal-attention block, and a multi-task unified loss to achieve real-time inference with fixed memory while maintaining high accuracy. Key contributions include a GT-Primed memory mechanism with burn-in anchor loss, a constant-memory temporal block that avoids token growth, and end-to-end optimization for detection, tracking, and forecasting, demonstrated on Mini-LaSOT (20%) with state-of-the-art performance and strong efficiency. The work offers a practical, deployable paradigm for scalable, reproducible transformer tracking that can extend to multi-object and multi-modal sequential tasks.

Abstract

Accurate single-object tracking and short-term motion forecasting remain challenging under occlusion, scale variation, and temporal drift, which disrupt the temporal coherence required for real-time perception. We introduce \textbf{SOTFormer}, a minimal constant-memory temporal transformer that unifies object detection, tracking, and short-horizon trajectory prediction within a single end-to-end framework. Unlike prior models with recurrent or stacked temporal encoders, SOTFormer achieves stable identity propagation through a ground-truth-primed memory and a burn-in anchor loss that explicitly stabilizes initialization. A single lightweight temporal-attention layer refines embeddings across frames, enabling real-time inference with fixed GPU memory. On the Mini-LaSOT (20%) benchmark, SOTFormer attains 76.3 AUC and 53.7 FPS (AMP, 4.3 GB VRAM), outperforming transformer baselines such as TrackFormer and MOTRv2 under fast motion, scale change, and occlusion.

Paper Structure

This paper contains 33 sections, 5 equations, 3 figures, 4 tables.

Figures (3)

  • Figure 1: SOTFormer Overview: A constant-memory temporal transformer that integrates detection, tracking, and short-horizon trajectory prediction. Early frames are Ground-Truth-Primed by IoU-based query swapping, a lightweight temporal-attention block refines cross-frame embeddings, and a trajectory head predicts cumulative motion for unified spatio-temporal reasoning.
  • Figure 2: Architecture of SOTFormer. Each input frame $I_t$ is processed by a Deformable-DETR backbone to extract multi-scale features and produce query embeddings $Q_t$. During training, a Ground-Truth-Primed (GT-Primed) slot-0 swap anchors the target identity in early frames. The Constant-Memory Temporal Block refines current queries $Q_t$ using the detached latent memory $M_{t-1}$, producing updated embeddings $\tilde{Q}_t$ while keeping memory and gradient depth constant. Three parallel feed-forward network (FFN) heads then decode $\tilde{Q}_t$ into task-specific outputs: bounding boxes $b_t$, class logits $c_t$, and trajectory offsets $\Delta p_{1:H}$. These outputs are supervised by corresponding losses— Spatial Loss ($L_1 + \mathrm{GIoU}$) for localization, Classification Loss (CE) for recognition, and Motion Loss (ADE + FDE) for trajectory forecasting. Together, these components form a unified framework for detection, tracking, and short-horizon prediction with constant memory cost.
  • Figure 3: Qualitative visualization under challenging scenarios. Rows (a)–(h) show representative Mini-LaSOT sequences: (a) Fast Motion (FM), (b) Occlusion (OCC), (c) Scale Change (SC), (d) Illumination Change (IC), (e) Nighttime (NT), (f) Background Clutter (BC), (g) Deformation (DF), (h) Underwater Environment (UE). Green: prediction, Red: ground truth, Yellow: predicted trajectory.