Table of Contents
Fetching ...

Rethinking the competition between detection and ReID in Multi-Object Tracking

Chao Liang, Zhipeng Zhang, Xue Zhou, Bing Li, Shuyuan Zhu, Weiming Hu

TL;DR

The paper tackles the competition between detection and ReID in one-shot multi-object tracking by introducing CSTrack, which comprises a Reciprocal Network (REN) to learn task-specific representations and a Scale-aware Attention Network (SAAN) to align multi-resolution ID embeddings. By decoupling tasks and fusing shared information through self- and cross-relational mechanisms, CSTrack achieves state-of-the-art performance on MOT16, MOT17, and MOT20 while maintaining real-time efficiency (16.4 FPS; CSTrack-S 34.6 FPS). The authors provide extensive ablations and upper-bound analyses, demonstrating substantial gains in accuracy (notably IDF1) and robust data association, especially in crowded scenes. The work offers a practical, scalable approach to improving one-shot MOT without resorting to heavier two-stage pipelines, with released code for replication and extension.

Abstract

Due to balanced accuracy and speed, one-shot models which jointly learn detection and identification embeddings, have drawn great attention in multi-object tracking (MOT). However, the inherent differences and relations between detection and re-identification (ReID) are unconsciously overlooked because of treating them as two isolated tasks in the one-shot tracking paradigm. This leads to inferior performance compared with existing two-stage methods. In this paper, we first dissect the reasoning process for these two tasks, which reveals that the competition between them inevitably would destroy task-dependent representations learning. To tackle this problem, we propose a novel reciprocal network (REN) with a self-relation and cross-relation design so that to impel each branch to better learn task-dependent representations. The proposed model aims to alleviate the deleterious tasks competition, meanwhile improve the cooperation between detection and ReID. Furthermore, we introduce a scale-aware attention network (SAAN) that prevents semantic level misalignment to improve the association capability of ID embeddings. By integrating the two delicately designed networks into a one-shot online MOT system, we construct a strong MOT tracker, namely CSTrack. Our tracker achieves the state-of-the-art performance on MOT16, MOT17 and MOT20 datasets, without other bells and whistles. Moreover, CSTrack is efficient and runs at 16.4 FPS on a single modern GPU, and its lightweight version even runs at 34.6 FPS. The complete code has been released at https://github.com/JudasDie/SOTS.

Rethinking the competition between detection and ReID in Multi-Object Tracking

TL;DR

The paper tackles the competition between detection and ReID in one-shot multi-object tracking by introducing CSTrack, which comprises a Reciprocal Network (REN) to learn task-specific representations and a Scale-aware Attention Network (SAAN) to align multi-resolution ID embeddings. By decoupling tasks and fusing shared information through self- and cross-relational mechanisms, CSTrack achieves state-of-the-art performance on MOT16, MOT17, and MOT20 while maintaining real-time efficiency (16.4 FPS; CSTrack-S 34.6 FPS). The authors provide extensive ablations and upper-bound analyses, demonstrating substantial gains in accuracy (notably IDF1) and robust data association, especially in crowded scenes. The work offers a practical, scalable approach to improving one-shot MOT without resorting to heavier two-stage pipelines, with released code for replication and extension.

Abstract

Due to balanced accuracy and speed, one-shot models which jointly learn detection and identification embeddings, have drawn great attention in multi-object tracking (MOT). However, the inherent differences and relations between detection and re-identification (ReID) are unconsciously overlooked because of treating them as two isolated tasks in the one-shot tracking paradigm. This leads to inferior performance compared with existing two-stage methods. In this paper, we first dissect the reasoning process for these two tasks, which reveals that the competition between them inevitably would destroy task-dependent representations learning. To tackle this problem, we propose a novel reciprocal network (REN) with a self-relation and cross-relation design so that to impel each branch to better learn task-dependent representations. The proposed model aims to alleviate the deleterious tasks competition, meanwhile improve the cooperation between detection and ReID. Furthermore, we introduce a scale-aware attention network (SAAN) that prevents semantic level misalignment to improve the association capability of ID embeddings. By integrating the two delicately designed networks into a one-shot online MOT system, we construct a strong MOT tracker, namely CSTrack. Our tracker achieves the state-of-the-art performance on MOT16, MOT17 and MOT20 datasets, without other bells and whistles. Moreover, CSTrack is efficient and runs at 16.4 FPS on a single modern GPU, and its lightweight version even runs at 34.6 FPS. The complete code has been released at https://github.com/JudasDie/SOTS.

Paper Structure

This paper contains 17 sections, 11 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Motivation of CSTrack. (a) visualizes the similarity maps about detection and ReID tasks respectively, where detection expects all the pedestrians have high response values whereas ReID tends to focus on the specific pedestrian. From this point of view, they are contradictory to each other. (b) represents that different resolution focuses on different scale of targets in FPN-based model, where the arrow indicates the output resolution from high to low. This arrangement can help detector to detect pedestrians with different sizes, but not suitable for semantic matching of pedestrians with different sizes in ReID task.
  • Figure 2: Architecture diagrams. (a) is the feature extractor including backbone and neck (FPN). (b) illustrates the vanilla prediction structure of JDE. (c) illustrates our proposed prediction structure of CSTrack. Different from JDE, CSTrack introduces a reciprocal network (REN) to learn task-dependent representations and a scale-aware attention network to generate discriminative embeddings, which efficiently mitigate the competition.
  • Figure 3: Diagram of reciprocal network (REN). For the original feature map $\boldsymbol{F}_i$, we construct the self-relation and cross-relation maps to impel the generation of task-dependent features $\boldsymbol{F}_i^{T_1}$ and $\boldsymbol{F}_i^{T_2}$.
  • Figure 4: The details of scale-aware attention network (SAAN). Wherein, (a) is the overall structure of the network, (b) is the diagram of spatial attention module (SAM), and (c) is the diagram of channel attention module (CAM).
  • Figure 5: The details of online tracking. For the detected candidate boxes, we will link them with existing tracklets by a cascade matching design following JDE JDE.
  • ...and 3 more figures