Contrastive Learning for Multi-Object Tracking with Transformers
Pierre-François De Plaen, Nicola Marinello, Marc Proesmans, Tinne Tuytelaars, Luc Van Gool
TL;DR
This work presents ContrasTR, a DETR-based approach to multi-object tracking that learns robust identity-level representations through an instance-level contrastive loss and a carefully designed sampling strategy. By casting MOT as a multi-task problem and employing a lightweight, memory-guided online association, the method preserves detection quality while enabling reliable re-identification without heavy architectural additions. A scalable pre-training scheme on detection data further enhances the embedding space, improving tracking performance. Empirically, ContrasTR achieves a new state-of-the-art $mMOTA$ on BDD100K and remains competitive with Transformer-based trackers on MOT17, while maintaining efficiency and scalability for large-scale datasets.
Abstract
The DEtection TRansformer (DETR) opened new possibilities for object detection by modeling it as a translation task: converting image features into object-level representations. Previous works typically add expensive modules to DETR to perform Multi-Object Tracking (MOT), resulting in more complicated architectures. We instead show how DETR can be turned into a MOT model by employing an instance-level contrastive loss, a revised sampling strategy and a lightweight assignment method. Our training scheme learns object appearances while preserving detection capabilities and with little overhead. Its performance surpasses the previous state-of-the-art by +2.6 mMOTA on the challenging BDD100K dataset and is comparable to existing transformer-based methods on the MOT17 dataset.
