Table of Contents
Fetching ...

TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

Peng Chu, Jiang Wang, Quanzeng You, Haibin Ling, Zicheng Liu

TL;DR

TransMOT introduces a spatial-temporal graph Transformer for online multi-object tracking by representing tracklets and detections as sparse weighted graphs. It encodes spatial relations with a spatial graph transformer encoder, fuses temporal context with a temporal encoder, and decodes matchings via a spatial graph transformer decoder, all optimized under a relaxed assignment loss. A cascade association framework further improves speed and robustness by filtering low-score detections and handling long-term occlusions. Empirical results on MOT15, MOT16, MOT17, and MOT20 show state-of-the-art performance across IDF1/MOTA while achieving efficiency gains due to sparsity.

Abstract

Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked objects as a set of sparse weighted graphs, and constructing a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial graph transformer decoder layer based on the graphs. TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy. To further improve the tracking speed and accuracy, we propose a cascade association framework to handle low-score detections and long-term occlusions that require large computational resources to model in TransMOT. The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20, and it achieves state-of-the-art performance on all the datasets.

TransMOT: Spatial-Temporal Graph Transformer for Multiple Object Tracking

TL;DR

TransMOT introduces a spatial-temporal graph Transformer for online multi-object tracking by representing tracklets and detections as sparse weighted graphs. It encodes spatial relations with a spatial graph transformer encoder, fuses temporal context with a temporal encoder, and decodes matchings via a spatial graph transformer decoder, all optimized under a relaxed assignment loss. A cascade association framework further improves speed and robustness by filtering low-score detections and handling long-term occlusions. Empirical results on MOT15, MOT16, MOT17, and MOT20 show state-of-the-art performance across IDF1/MOTA while achieving efficiency gains due to sparsity.

Abstract

Tracking multiple objects in videos relies on modeling the spatial-temporal interactions of the objects. In this paper, we propose a solution named TransMOT, which leverages powerful graph transformers to efficiently model the spatial and temporal interactions among the objects. TransMOT effectively models the interactions of a large number of objects by arranging the trajectories of the tracked objects as a set of sparse weighted graphs, and constructing a spatial graph transformer encoder layer, a temporal transformer encoder layer, and a spatial graph transformer decoder layer based on the graphs. TransMOT is not only more computationally efficient than the traditional Transformer, but it also achieves better tracking accuracy. To further improve the tracking speed and accuracy, we propose a cascade association framework to handle low-score detections and long-term occlusions that require large computational resources to model in TransMOT. The proposed method is evaluated on multiple benchmark datasets including MOT15, MOT16, MOT17, and MOT20, and it achieves state-of-the-art performance on all the datasets.

Paper Structure

This paper contains 15 sections, 7 equations, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Overview of the proposed TransMOT pipeline for online MOT. The trajectories graph series $\mathbf{\Xi}^{t-1}$ till frame $t-1$ and detection candidates graph $\Theta^t$ at frame $t$ serve as the source and target inputs, respectively, to the spatial-temporal graph transformer.
  • Figure 2: The spatial graph transformer encoder layer.
  • Figure 3: Illustration of the spatial graph transformer decoder.
  • Figure 4: Illustration of the cascade association framework based tracking system.
  • Figure 5: Results visualization of selected sequences in MOT15, MOT16, MOT17, and MOT20.