Table of Contents
Fetching ...

Multiple Object Tracking as ID Prediction

Ruopeng Gao, Ji Qi, Limin Wang

TL;DR

This work reframes multi-object tracking as an in-context ID prediction task and introduces MOTIP, a lightweight, end-to-end framework that decouples detection from association. By attaching learnable ID embeddings from a K+1 identity dictionary to tracklets and predicting IDs via a transformer-based ID Decoder, MOTIP achieves state-of-the-art results on DanceTrack, SportsMOT, and BFT NetTrack without relying on handcrafted heuristics or extra data. The approach emphasizes generalization to unseen identities, robustness through trajectory augmentation, and simple inference without complex assignment rules. Overall, MOTIP demonstrates the potential of in-context identity prompts to drive robust MOT in challenging real-world scenarios.

Abstract

Multi-Object Tracking (MOT) has been a long-standing challenge in video understanding. A natural and intuitive approach is to split this task into two parts: object detection and association. Most mainstream methods employ meticulously crafted heuristic techniques to maintain trajectory information and compute cost matrices for object matching. Although these methods can achieve notable tracking performance, they often require a series of elaborate handcrafted modifications while facing complicated scenarios. We believe that manually assumed priors limit the method's adaptability and flexibility in learning optimal tracking capabilities from domain-specific data. Therefore, we introduce a new perspective that treats Multiple Object Tracking as an in-context ID Prediction task, transforming the aforementioned object association into an end-to-end trainable task. Based on this, we propose a simple yet effective method termed MOTIP. Given a set of trajectories carried with ID information, MOTIP directly decodes the ID labels for current detections to accomplish the association process. Without using tailored or sophisticated architectures, our method achieves state-of-the-art results across multiple benchmarks by solely leveraging object-level features as tracking cues. The simplicity and impressive results of MOTIP leave substantial room for future advancements, thereby making it a promising baseline for subsequent research. Our code and checkpoints are released at https://github.com/MCG-NJU/MOTIP.

Multiple Object Tracking as ID Prediction

TL;DR

This work reframes multi-object tracking as an in-context ID prediction task and introduces MOTIP, a lightweight, end-to-end framework that decouples detection from association. By attaching learnable ID embeddings from a K+1 identity dictionary to tracklets and predicting IDs via a transformer-based ID Decoder, MOTIP achieves state-of-the-art results on DanceTrack, SportsMOT, and BFT NetTrack without relying on handcrafted heuristics or extra data. The approach emphasizes generalization to unseen identities, robustness through trajectory augmentation, and simple inference without complex assignment rules. Overall, MOTIP demonstrates the potential of in-context identity prompts to drive robust MOT in challenging real-world scenarios.

Abstract

Multi-Object Tracking (MOT) has been a long-standing challenge in video understanding. A natural and intuitive approach is to split this task into two parts: object detection and association. Most mainstream methods employ meticulously crafted heuristic techniques to maintain trajectory information and compute cost matrices for object matching. Although these methods can achieve notable tracking performance, they often require a series of elaborate handcrafted modifications while facing complicated scenarios. We believe that manually assumed priors limit the method's adaptability and flexibility in learning optimal tracking capabilities from domain-specific data. Therefore, we introduce a new perspective that treats Multiple Object Tracking as an in-context ID Prediction task, transforming the aforementioned object association into an end-to-end trainable task. Based on this, we propose a simple yet effective method termed MOTIP. Given a set of trajectories carried with ID information, MOTIP directly decodes the ID labels for current detections to accomplish the association process. Without using tailored or sophisticated architectures, our method achieves state-of-the-art results across multiple benchmarks by solely leveraging object-level features as tracking cues. The simplicity and impressive results of MOTIP leave substantial room for future advancements, thereby making it a promising baseline for subsequent research. Our code and checkpoints are released at https://github.com/MCG-NJU/MOTIP.
Paper Structure (30 sections, 3 equations, 10 figures, 9 tables)

This paper contains 30 sections, 3 equations, 10 figures, 9 tables.

Figures (10)

  • Figure 1: Diagram of the in-context ID prediction process. Different colored bounding boxes represent targets corresponding to different trajectories. We provide two valid ID prediction results, shown in the two lines below. This indicates that each trajectory only needs to predict the corresponding label based on the historical ID information, rather than being assigned a fixed label.
  • Figure 2: Overview of MOTIP. There are three primary components: a DETR detector detects objects, a learnable ID dictionary represents different identities, and an ID Decoder predicts the ID labels of current objects, as we detailed in \ref{['sec:motip-architecture']}. We combine object features with their corresponding ID embeddings to form the historical trajectories $\mathcal{T}_{t-T:t-1}$. Subsequently, the ID tokens are regarded as identity prompts, and the ID Decoder performs in-context ID prediction based on them, as discussed in \ref{['sec:in-context-ID-prediction']} and \ref{['sec:motip-architecture']}.
  • Figure 3: Illustration of trajectory augmentation: trajectory random occlusion (left) and trajectory random switch (right). Two different colors represent two distinct trajectories.
  • Figure 4: Illustration of the parallelized training of MOTIP, using a five-frame demo. Since the detection process for each frame is independent, all DETRs in a sequence can perform forward simultaneously, which is GPU-friendly. In our implementation, we divide all DETRs into two forward passes (as shown in numbers $1$ and $2$) since we only backpropagate gradients for a subset of them, as described in \ref{['sec:implementation-details']}
  • Figure 5: Python-like pseudocode for the core of our ID assignment process.
  • ...and 5 more figures