From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

Yuqing Shao; Yuchen Yang; Rui Yu; Weilong Li; Xu Guo; Huaicheng Yan; Wei Wang; Xiao Sun

From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

Yuqing Shao, Yuchen Yang, Rui Yu, Weilong Li, Xu Guo, Huaicheng Yan, Wei Wang, Xiao Sun

TL;DR

This work identifies that DETR-based end-to-end MOT methods suffer from high inter-object embedding similarity, hindering association. It introduces FDTA, a discriminative embedding refinement framework with Spatial Adapter (depth-aware spatial cues), Temporal Adapter (trajectory-aware temporal modeling), and Identity Adapter (quality-aware contrastive learning) to tighten identity grouping across frames. Through extensive experiments on DanceTrack, SportsMOT, and BFT, FDTA achieves state-of-the-art HOTA and IDF1 while maintaining efficient inference. The results demonstrate the practical value of explicitly optimizing object embeddings for association in end-to-end MOT, with broad potential for future integration of foundation-model-derived supervision and robust corner-case synthesis.

Abstract

End-to-end multi-object tracking (MOT) methods have recently achieved remarkable progress by unifying detection and association within a single framework. Despite their strong detection performance, these methods suffer from relatively low association accuracy. Through detailed analysis, we observe that object embeddings produced by the shared DETR architecture display excessively high inter-object similarity, as it emphasizes only category-level discrimination within single frames. In contrast, tracking requires instance-level distinction across frames with spatial and temporal continuity, for which current end-to-end approaches insufficiently optimize object embeddings. To address this, we introduce FDTA (From Detection to Association), an explicit feature refinement framework that enhances object discriminativeness across three complementary perspectives. Specifically, we introduce a Spatial Adapter (SA) to integrate depth-aware cues for spatial continuity, a Temporal Adapter (TA) to aggregate historical information for temporal dependencies, and an Identity Adapter (IA) to leverage quality-aware contrastive learning for instance-level separability. Extensive experiments demonstrate that FDTA achieves state-of-the-art performance on multiple challenging MOT benchmarks, including DanceTrack, SportsMOT, and BFT, highlighting the effectiveness of our proposed discriminative embedding enhancement strategy. The code is available at https://github.com/Spongebobbbbbbbb/FDTA.

From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

TL;DR

Abstract

From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)