Table of Contents
Fetching ...

MOTR: End-to-End Multiple-Object Tracking with Transformer

Fangao Zeng, Bin Dong, Yuang Zhang, Tiancai Wang, Xiangyu Zhang, Yichen Wei

TL;DR

MOTR presents an end-to-end approach to multi-object tracking by extending DETR with track queries that iteratively predict object trajectories over time. It introduces tracklet-aware label assignment, an entrance/exit mechanism for newborn and terminated objects, and temporal modeling enhancements via a temporal aggregation network and a collective average loss. The method demonstrates strong temporal modeling capabilities, achieving state-of-the-art association performance on DanceTrack and competitive results on MOT17 compared with Transformer-based trackers, while remaining fully online and post-processing free. Overall, MOTR provides a strong, end-to-end baseline for Transformer-based MOT and emphasizes learning temporal dynamics without hand-crafted post-processing steps.

Abstract

Temporal modeling of objects is a key challenge in multiple object tracking (MOT). Existing methods track by associating detections through motion-based and appearance-based similarity heuristics. The post-processing nature of association prevents end-to-end exploitation of temporal variations in video sequence. In this paper, we propose MOTR, which extends DETR and introduces track query to model the tracked instances in the entire video. Track query is transferred and updated frame-by-frame to perform iterative prediction over time. We propose tracklet-aware label assignment to train track queries and newborn object queries. We further propose temporal aggregation network and collective average loss to enhance temporal relation modeling. Experimental results on DanceTrack show that MOTR significantly outperforms state-of-the-art method, ByteTrack by 6.5% on HOTA metric. On MOT17, MOTR outperforms our concurrent works, TrackFormer and TransTrack, on association performance. MOTR can serve as a stronger baseline for future research on temporal modeling and Transformer-based trackers. Code is available at https://github.com/megvii-research/MOTR.

MOTR: End-to-End Multiple-Object Tracking with Transformer

TL;DR

MOTR presents an end-to-end approach to multi-object tracking by extending DETR with track queries that iteratively predict object trajectories over time. It introduces tracklet-aware label assignment, an entrance/exit mechanism for newborn and terminated objects, and temporal modeling enhancements via a temporal aggregation network and a collective average loss. The method demonstrates strong temporal modeling capabilities, achieving state-of-the-art association performance on DanceTrack and competitive results on MOT17 compared with Transformer-based trackers, while remaining fully online and post-processing free. Overall, MOTR provides a strong, end-to-end baseline for Transformer-based MOT and emphasizes learning temporal dynamics without hand-crafted post-processing steps.

Abstract

Temporal modeling of objects is a key challenge in multiple object tracking (MOT). Existing methods track by associating detections through motion-based and appearance-based similarity heuristics. The post-processing nature of association prevents end-to-end exploitation of temporal variations in video sequence. In this paper, we propose MOTR, which extends DETR and introduces track query to model the tracked instances in the entire video. Track query is transferred and updated frame-by-frame to perform iterative prediction over time. We propose tracklet-aware label assignment to train track queries and newborn object queries. We further propose temporal aggregation network and collective average loss to enhance temporal relation modeling. Experimental results on DanceTrack show that MOTR significantly outperforms state-of-the-art method, ByteTrack by 6.5% on HOTA metric. On MOT17, MOTR outperforms our concurrent works, TrackFormer and TransTrack, on association performance. MOTR can serve as a stronger baseline for future research on temporal modeling and Transformer-based trackers. Code is available at https://github.com/megvii-research/MOTR.

Paper Structure

This paper contains 19 sections, 4 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: (a) DETR achieves end-to-end detection by interacting object queries with image features and performs one-to-one assignment between the updated queries and objects. (b) MOTR performs set of sequence prediction by updating the track queries. Each track query represents a track. Best viewed in color.
  • Figure 2: Update process of detect (object) queries and track queries under some typical MOT cases. Track query set is updated dynamically, and the length is variable. Track query set is initialized to be empty, and the detect queries are used to detect newborn objects. Hidden states of all detected objects are concatenated to produce track queries for the next frame. Track queries assigned to terminated objects are removed from the track query set.
  • Figure 3: The overall architecture of MOTR. "Enc" represents a convolutional neural network backbone and the Transformer encoder that extracts image features for each frame. The concatenation of detect queries $q_{d}$ and track queries $q_{tr}$ is fed into the Deformable DETR decoder (Dec) to produce the hidden states. The hidden states are used to generate the prediction $\widehat{Y}$ of newborn and tracked objects. The query interaction module (QIM) takes the hidden states as input and produces track queries for the next frame.
  • Figure 4: The structure of query interaction module (QIM). The inputs of QIM are the hidden state produced by Transformer decoder and the corresponding prediction scores. In the inference stage, we keep newborn objects and drop exited objects based on the confidence scores. Temporal aggregation network (TAN) enhances long-range temporal modeling.
  • Figure 5: The effect of CAL on solving (a) duplicated boxes and (b) ID switch problems. Top and bottom rows are the tracking results without and with CAL, respectively.