Table of Contents
Fetching ...

Is a Pure Transformer Effective for Separated and Online Multi-Object Tracking?

Chongwei Liu, Haojie Li, Zhihui Wang, Rui Xu

TL;DR

This work reframes trajectory graphs in multi-object tracking as directed acyclic graphs and validates a Pure Transformer (PuTR) that operates in an online, separated TbD setting. PuTR uses a frame-ordered object sequence with a frame-aware attention mask and temporal/spatial encodings to unify short- and long-term association within a decoder-only Transformer, avoiding a fixed object-ID dictionary. Across MOT17, MOT20, DanceTrack, and SportsMOT, PuTR achieves competitive baselines, strong domain adaptation (minimal cross-dataset gap), and real-time inference, while preserving online processing. The results suggest that pure Transformer architectures offer a viable, efficient, and adaptable direction for MOT association tasks, with potential extensions to motion cues and broader tracking domains.

Abstract

Recent advances in Multi-Object Tracking (MOT) have demonstrated significant success in short-term association within the separated tracking-by-detection online paradigm. However, long-term tracking remains challenging. While graph-based approaches address this by modeling trajectories as global graphs, these methods are unsuitable for real-time applications due to their non-online nature. In this paper, we review the concept of trajectory graphs and propose a novel perspective by representing them as directed acyclic graphs. This representation can be described using frame-ordered object sequences and binary adjacency matrices. We observe that this structure naturally aligns with Transformer attention mechanisms, enabling us to model the association problem using a classic Transformer architecture. Based on this insight, we introduce a concise Pure Transformer (PuTR) to validate the effectiveness of Transformer in unifying short- and long-term tracking for separated online MOT. Extensive experiments on four diverse datasets (SportsMOT, DanceTrack, MOT17, and MOT20) demonstrate that PuTR effectively establishes a solid baseline compared to existing foundational online methods while exhibiting superior domain adaptation capabilities. Furthermore, the separated nature enables efficient training and inference, making it suitable for practical applications. Implementation code and trained models are available at https://github.com/chongweiliu/PuTR .

Is a Pure Transformer Effective for Separated and Online Multi-Object Tracking?

TL;DR

This work reframes trajectory graphs in multi-object tracking as directed acyclic graphs and validates a Pure Transformer (PuTR) that operates in an online, separated TbD setting. PuTR uses a frame-ordered object sequence with a frame-aware attention mask and temporal/spatial encodings to unify short- and long-term association within a decoder-only Transformer, avoiding a fixed object-ID dictionary. Across MOT17, MOT20, DanceTrack, and SportsMOT, PuTR achieves competitive baselines, strong domain adaptation (minimal cross-dataset gap), and real-time inference, while preserving online processing. The results suggest that pure Transformer architectures offer a viable, efficient, and adaptable direction for MOT association tasks, with potential extensions to motion cues and broader tracking domains.

Abstract

Recent advances in Multi-Object Tracking (MOT) have demonstrated significant success in short-term association within the separated tracking-by-detection online paradigm. However, long-term tracking remains challenging. While graph-based approaches address this by modeling trajectories as global graphs, these methods are unsuitable for real-time applications due to their non-online nature. In this paper, we review the concept of trajectory graphs and propose a novel perspective by representing them as directed acyclic graphs. This representation can be described using frame-ordered object sequences and binary adjacency matrices. We observe that this structure naturally aligns with Transformer attention mechanisms, enabling us to model the association problem using a classic Transformer architecture. Based on this insight, we introduce a concise Pure Transformer (PuTR) to validate the effectiveness of Transformer in unifying short- and long-term tracking for separated online MOT. Extensive experiments on four diverse datasets (SportsMOT, DanceTrack, MOT17, and MOT20) demonstrate that PuTR effectively establishes a solid baseline compared to existing foundational online methods while exhibiting superior domain adaptation capabilities. Furthermore, the separated nature enables efficient training and inference, making it suitable for practical applications. Implementation code and trained models are available at https://github.com/chongweiliu/PuTR .
Paper Structure (36 sections, 4 equations, 4 figures, 7 tables, 1 algorithm)

This paper contains 36 sections, 4 equations, 4 figures, 7 tables, 1 algorithm.

Figures (4)

  • Figure 1: Illustration of our chain of thought. Trajectories are inherently a directed acyclic graph in the temporal order (a). We can thus transform it equivalently into a binary adjacency matrix (b), exactly aligning with the Transformer's attention mask. Consequently, arranging objects by frame forms a natural input sequence for the Transformer (c), enabling it to model the association problem.
  • Figure 2: Distribution of object disappearance intervals in the SportsMOT and DanceTrack validation datasets. For each tracked identity, we compute the temporal gap (Interval, the x-axis) between adjacent detections by calculating the difference between current and previous frame numbers, while the y-axis shows their percentage among all non-consecutive cases (interval > 1).
  • Figure 3: Visual results of PuTR on the #dancetrack0011 video sequence of DanceTrack (top row) and the #v_1UDUODIBSsc_c004 video sequence of SportsMOT (bottom row), showcasing the model's capability to handle challenging scenarios such as extended occlusions (the blue arrow), and the individual moving out of and re-entering the camera view (the dark arrow).
  • Figure 4: An example of the failure case in video sequence #MOT17-14 of MOT17. The tiny pink bbox changes individuals twice in the second and fourth frame, due to the weak appearance cues.