Table of Contents
Fetching ...

From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

Yuqing Shao, Yuchen Yang, Rui Yu, Weilong Li, Xu Guo, Huaicheng Yan, Wei Wang, Xiao Sun

TL;DR

This work identifies that DETR-based end-to-end MOT methods suffer from high inter-object embedding similarity, hindering association. It introduces FDTA, a discriminative embedding refinement framework with Spatial Adapter (depth-aware spatial cues), Temporal Adapter (trajectory-aware temporal modeling), and Identity Adapter (quality-aware contrastive learning) to tighten identity grouping across frames. Through extensive experiments on DanceTrack, SportsMOT, and BFT, FDTA achieves state-of-the-art HOTA and IDF1 while maintaining efficient inference. The results demonstrate the practical value of explicitly optimizing object embeddings for association in end-to-end MOT, with broad potential for future integration of foundation-model-derived supervision and robust corner-case synthesis.

Abstract

End-to-end multi-object tracking (MOT) methods have recently achieved remarkable progress by unifying detection and association within a single framework. Despite their strong detection performance, these methods suffer from relatively low association accuracy. Through detailed analysis, we observe that object embeddings produced by the shared DETR architecture display excessively high inter-object similarity, as it emphasizes only category-level discrimination within single frames. In contrast, tracking requires instance-level distinction across frames with spatial and temporal continuity, for which current end-to-end approaches insufficiently optimize object embeddings. To address this, we introduce FDTA (From Detection to Association), an explicit feature refinement framework that enhances object discriminativeness across three complementary perspectives. Specifically, we introduce a Spatial Adapter (SA) to integrate depth-aware cues for spatial continuity, a Temporal Adapter (TA) to aggregate historical information for temporal dependencies, and an Identity Adapter (IA) to leverage quality-aware contrastive learning for instance-level separability. Extensive experiments demonstrate that FDTA achieves state-of-the-art performance on multiple challenging MOT benchmarks, including DanceTrack, SportsMOT, and BFT, highlighting the effectiveness of our proposed discriminative embedding enhancement strategy. The code is available at https://github.com/Spongebobbbbbbbb/FDTA.

From Detection to Association: Learning Discriminative Object Embeddings for Multi-Object Tracking

TL;DR

This work identifies that DETR-based end-to-end MOT methods suffer from high inter-object embedding similarity, hindering association. It introduces FDTA, a discriminative embedding refinement framework with Spatial Adapter (depth-aware spatial cues), Temporal Adapter (trajectory-aware temporal modeling), and Identity Adapter (quality-aware contrastive learning) to tighten identity grouping across frames. Through extensive experiments on DanceTrack, SportsMOT, and BFT, FDTA achieves state-of-the-art HOTA and IDF1 while maintaining efficient inference. The results demonstrate the practical value of explicitly optimizing object embeddings for association in end-to-end MOT, with broad potential for future integration of foundation-model-derived supervision and robust corner-case synthesis.

Abstract

End-to-end multi-object tracking (MOT) methods have recently achieved remarkable progress by unifying detection and association within a single framework. Despite their strong detection performance, these methods suffer from relatively low association accuracy. Through detailed analysis, we observe that object embeddings produced by the shared DETR architecture display excessively high inter-object similarity, as it emphasizes only category-level discrimination within single frames. In contrast, tracking requires instance-level distinction across frames with spatial and temporal continuity, for which current end-to-end approaches insufficiently optimize object embeddings. To address this, we introduce FDTA (From Detection to Association), an explicit feature refinement framework that enhances object discriminativeness across three complementary perspectives. Specifically, we introduce a Spatial Adapter (SA) to integrate depth-aware cues for spatial continuity, a Temporal Adapter (TA) to aggregate historical information for temporal dependencies, and an Identity Adapter (IA) to leverage quality-aware contrastive learning for instance-level separability. Extensive experiments demonstrate that FDTA achieves state-of-the-art performance on multiple challenging MOT benchmarks, including DanceTrack, SportsMOT, and BFT, highlighting the effectiveness of our proposed discriminative embedding enhancement strategy. The code is available at https://github.com/Spongebobbbbbbbb/FDTA.

Paper Structure

This paper contains 41 sections, 16 equations, 13 figures, 11 tables.

Figures (13)

  • Figure 1: Embedding similarity analysis on DanceTrack. For each frame, we compute pairwise similarities between objects and select the top-3 highest values among all pairs, then construct their distribution across all frames. FDTA produces object embeddings with significantly lower similarity compared to existing methods.
  • Figure 2: Illustration of the requirements of detection and association. It differs across spatial, temporal, and identity perspectives.
  • Figure 3: Overview of the FDTA framework. DETR produces object embeddings from input frames. The object embeddings are then refined by three explicit adapters for discriminativeness: Spatial Adapter (SA) integrates 3D geometric cues via depth learning; Temporal Adapter (TA) captures temporal dependencies via trajectory modeling; Identity Adapter (IA) promotes instance-level identification via contrastive learning. Finally, an ID Prediction module performs the object association based on the enhanced embeddings.
  • Figure 4: The detailed architecture of Spatial Adapter.
  • Figure 5: Visualization of predicted depth maps on DanceTrack and SportsMOT.
  • ...and 8 more figures