Contrastive Learning for Multi-Object Tracking with Transformers

Pierre-François De Plaen; Nicola Marinello; Marc Proesmans; Tinne Tuytelaars; Luc Van Gool

Contrastive Learning for Multi-Object Tracking with Transformers

Pierre-François De Plaen, Nicola Marinello, Marc Proesmans, Tinne Tuytelaars, Luc Van Gool

TL;DR

This work presents ContrasTR, a DETR-based approach to multi-object tracking that learns robust identity-level representations through an instance-level contrastive loss and a carefully designed sampling strategy. By casting MOT as a multi-task problem and employing a lightweight, memory-guided online association, the method preserves detection quality while enabling reliable re-identification without heavy architectural additions. A scalable pre-training scheme on detection data further enhances the embedding space, improving tracking performance. Empirically, ContrasTR achieves a new state-of-the-art $mMOTA$ on BDD100K and remains competitive with Transformer-based trackers on MOT17, while maintaining efficiency and scalability for large-scale datasets.

Abstract

The DEtection TRansformer (DETR) opened new possibilities for object detection by modeling it as a translation task: converting image features into object-level representations. Previous works typically add expensive modules to DETR to perform Multi-Object Tracking (MOT), resulting in more complicated architectures. We instead show how DETR can be turned into a MOT model by employing an instance-level contrastive loss, a revised sampling strategy and a lightweight assignment method. Our training scheme learns object appearances while preserving detection capabilities and with little overhead. Its performance surpasses the previous state-of-the-art by +2.6 mMOTA on the challenging BDD100K dataset and is comparable to existing transformer-based methods on the MOT17 dataset.

Contrastive Learning for Multi-Object Tracking with Transformers

TL;DR

on BDD100K and remains competitive with Transformer-based trackers on MOT17, while maintaining efficiency and scalability for large-scale datasets.

Abstract

Paper Structure (22 sections, 4 equations, 6 figures, 13 tables)

This paper contains 22 sections, 4 equations, 6 figures, 13 tables.

Introduction
Related Work
Learning Identity-Level Representations
Preliminaries
mot as a multi-task learning problem
Sampling strategy
Learning MOT from Object Detection datasets
Object association with maximal similarity
Experiments
Experimental setup
Implementation details
Results
Ablation study
Conclusion
Acknowledgements
...and 7 more sections

Figures (6)

Figure 1: T-SNE projection of the predicted embeddings for the first 40 ground-truth objects in video b23f7012-fab06dac of the BDD100K validation set. Each color-symbol pair represents a ground-truth tracking ID assigned with DETR’s bipartite matching. Even during nighttime, the method can discriminate similar objects.
Figure 2: t-SNE visualization of the tracking embeddings of video 4 of MOT17. Each color-symbol pair represents a unique tracking ID, assigned with DETR's bipartite matching. All models are pre-trained on the CrowdHuman dataset shao2018crowdhuman and evaluated on the validation set of MOT17 MOT16. Without the contrastive loss, Deformable-DETR's embeddings are not clustered per instance id.
Figure 3: ContrasTR. Our framework during the inference phase. ID assignment maximizes the global cosine similarity between the predictions and previous instances. A new instance (N) entry is added to the cost matrix for each prediction.
Figure 4: Predictions our model on the validation set of BDD100K, each color represents a different predicted ID. The method is robust to occlusions and in the nighttime.
Figure 5: Predictions and failure cases of our model on the validation set of BDD100K.
...and 1 more figures

Contrastive Learning for Multi-Object Tracking with Transformers

TL;DR

Abstract

Contrastive Learning for Multi-Object Tracking with Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (6)