Table of Contents
Fetching ...

Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking

Yunfei Zhang, Chao Liang, Jin Gao, Zhipeng Zhang, Weiming Hu, Stephen Maybank, Xue Zhou, Liang Li

TL;DR

The paper addresses the bottlenecks of JDE-based MOT, where appearance features lack discriminability and detector-feature extractor conflicts hinder performance. It introduces TCBTrack, a lightweight tracker that injects cross-correlation–based temporal cues into the JDE framework, learning motion-aware heatmaps via a transductive module and a simplified embedding head. A cross-frame training scheme with Gaussian heatmaps and Logistic-MSE loss, plus a product-based association Score = $IoU \cdot Det\_score \cdot Temp\_score$ and Temp\_score = $\text{Cosine}(F^t_1, F^t_2)$, yields robust online tracking without heavy ReID modules. On MOT17, MOT20, and DanceTrack, TCBTrack achieves state-of-the-art or competitive results, notably $HOTA=56.8$, $IDF1=58.1$, and $MOTA=92.5$ on DanceTrack while maintaining real-time performance, demonstrating that temporal correlations can significantly enhance MOT robustness and speed.

Abstract

Joint Detection and Embedding (JDE) trackers have demonstrated excellent performance in Multi-Object Tracking (MOT) tasks by incorporating the extraction of appearance features as auxiliary tasks through embedding Re-Identification task (ReID) into the detector, achieving a balance between inference speed and tracking performance. However, solving the competition between the detector and the feature extractor has always been a challenge. Meanwhile, the issue of directly embedding the ReID task into MOT has remained unresolved. The lack of high discriminability in appearance features results in their limited utility. In this paper, a new learning approach using cross-correlation to capture temporal information of objects is proposed. The feature extraction network is no longer trained solely on appearance features from each frame but learns richer motion features by utilizing feature heatmaps from consecutive frames, which addresses the challenge of inter-class feature similarity. Furthermore, our learning approach is applied to a more lightweight feature extraction network, and treat the feature matching scores as strong cues rather than auxiliary cues, with an appropriate weight calculation to reflect the compatibility between our obtained features and the MOT task. Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks, i.e., MOT17, MOT20, and DanceTrack datasets. Specifically, on the DanceTrack test set, we achieve 56.8 HOTA, 58.1 IDF1 and 92.5 MOTA, making it the best online tracker capable of achieving real-time performance. Comparative evaluations with other trackers prove that our tracker achieves the best balance between speed, robustness and accuracy. Code is available at https://github.com/yfzhang1214/TCBTrack.

Temporal Correlation Meets Embedding: Towards a 2nd Generation of JDE-based Real-Time Multi-Object Tracking

TL;DR

The paper addresses the bottlenecks of JDE-based MOT, where appearance features lack discriminability and detector-feature extractor conflicts hinder performance. It introduces TCBTrack, a lightweight tracker that injects cross-correlation–based temporal cues into the JDE framework, learning motion-aware heatmaps via a transductive module and a simplified embedding head. A cross-frame training scheme with Gaussian heatmaps and Logistic-MSE loss, plus a product-based association Score = and Temp\_score = , yields robust online tracking without heavy ReID modules. On MOT17, MOT20, and DanceTrack, TCBTrack achieves state-of-the-art or competitive results, notably , , and on DanceTrack while maintaining real-time performance, demonstrating that temporal correlations can significantly enhance MOT robustness and speed.

Abstract

Joint Detection and Embedding (JDE) trackers have demonstrated excellent performance in Multi-Object Tracking (MOT) tasks by incorporating the extraction of appearance features as auxiliary tasks through embedding Re-Identification task (ReID) into the detector, achieving a balance between inference speed and tracking performance. However, solving the competition between the detector and the feature extractor has always been a challenge. Meanwhile, the issue of directly embedding the ReID task into MOT has remained unresolved. The lack of high discriminability in appearance features results in their limited utility. In this paper, a new learning approach using cross-correlation to capture temporal information of objects is proposed. The feature extraction network is no longer trained solely on appearance features from each frame but learns richer motion features by utilizing feature heatmaps from consecutive frames, which addresses the challenge of inter-class feature similarity. Furthermore, our learning approach is applied to a more lightweight feature extraction network, and treat the feature matching scores as strong cues rather than auxiliary cues, with an appropriate weight calculation to reflect the compatibility between our obtained features and the MOT task. Our tracker, named TCBTrack, achieves state-of-the-art performance on multiple public benchmarks, i.e., MOT17, MOT20, and DanceTrack datasets. Specifically, on the DanceTrack test set, we achieve 56.8 HOTA, 58.1 IDF1 and 92.5 MOTA, making it the best online tracker capable of achieving real-time performance. Comparative evaluations with other trackers prove that our tracker achieves the best balance between speed, robustness and accuracy. Code is available at https://github.com/yfzhang1214/TCBTrack.
Paper Structure (15 sections, 14 equations, 8 figures, 9 tables, 3 algorithms)

This paper contains 15 sections, 14 equations, 8 figures, 9 tables, 3 algorithms.

Figures (8)

  • Figure 1: Comparisons between ReID task in MOT and MOT task. (a) ReID task in MOT. (b) MOT task. The ReID task can still be accomplished effectively when embedded into the MOT algorithm, while during the association stage, matching is not performed due to the small weights. This is caused by the training approach of the ReID model, leading to the loss of temporal information, and the misalignment of weights due to the blurry templates in MOT.
  • Figure 2: Illustration of hard samples of JDE trackers. The dots in the dashed box represent the center of the object.
  • Figure 3: The main framework of our MOT tracker. It involves incorporating object feature extraction inspired by SOT task into JDE structure, aiming to compensate for the limitations of feature extraction.
  • Figure 4: Overview of the proposed TCBTrack. It consists of the JDE paradigm, the SOT paradigm and our training/inference stage(the blue arrows represent the process during the training stage.). The Detection Backbone first generates the detection result $\mathcal{B}^t, \mathcal{C}^t, \mathcal{P}^t$and candidate embeddings $F^{id}_t$. We use a Gaussian heatmap and Logistic-MSE loss in the training stage, and the joint scores of heatmap, IoU, and confidence are used as weights in the inference stage.
  • Figure 5: Sample heatmap results using ID loss and our Logistic-MSE loss. (a) Groundtruth heatmap. (b) Heatmap generated by ID loss. (c) Heatmap generated by Logistic-MSE loss. The area circled in red represents the highest response of the heatmap. Our method can suppress the surrounding response of the target box and alleviate the problem of incorrect matching between adjacent objects.
  • ...and 3 more figures