Is a Pure Transformer Effective for Separated and Online Multi-Object Tracking?
Chongwei Liu, Haojie Li, Zhihui Wang, Rui Xu
TL;DR
This work reframes trajectory graphs in multi-object tracking as directed acyclic graphs and validates a Pure Transformer (PuTR) that operates in an online, separated TbD setting. PuTR uses a frame-ordered object sequence with a frame-aware attention mask and temporal/spatial encodings to unify short- and long-term association within a decoder-only Transformer, avoiding a fixed object-ID dictionary. Across MOT17, MOT20, DanceTrack, and SportsMOT, PuTR achieves competitive baselines, strong domain adaptation (minimal cross-dataset gap), and real-time inference, while preserving online processing. The results suggest that pure Transformer architectures offer a viable, efficient, and adaptable direction for MOT association tasks, with potential extensions to motion cues and broader tracking domains.
Abstract
Recent advances in Multi-Object Tracking (MOT) have demonstrated significant success in short-term association within the separated tracking-by-detection online paradigm. However, long-term tracking remains challenging. While graph-based approaches address this by modeling trajectories as global graphs, these methods are unsuitable for real-time applications due to their non-online nature. In this paper, we review the concept of trajectory graphs and propose a novel perspective by representing them as directed acyclic graphs. This representation can be described using frame-ordered object sequences and binary adjacency matrices. We observe that this structure naturally aligns with Transformer attention mechanisms, enabling us to model the association problem using a classic Transformer architecture. Based on this insight, we introduce a concise Pure Transformer (PuTR) to validate the effectiveness of Transformer in unifying short- and long-term tracking for separated online MOT. Extensive experiments on four diverse datasets (SportsMOT, DanceTrack, MOT17, and MOT20) demonstrate that PuTR effectively establishes a solid baseline compared to existing foundational online methods while exhibiting superior domain adaptation capabilities. Furthermore, the separated nature enables efficient training and inference, making it suitable for practical applications. Implementation code and trained models are available at https://github.com/chongweiliu/PuTR .
