Table of Contents
Fetching ...

FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking

Yifu Zhang, Chunyu Wang, Xinggang Wang, Wenjun Zeng, Wenyu Liu

TL;DR

FairMOT tackles the fairness conflict between object detection and re-ID in MOT by introducing an anchor-free, center-based single-shot network built on CenterNet. By identifying and mitigating three unfairness factors—anchor-based sampling, shared features, and high re-ID dimensionality—the approach achieves balanced, high-accuracy detection and tracking. It uses two homogeneous branches (detection and re-ID) trained with uncertainty-based loss balancing and a lightweight single-image pre-training strategy for data efficiency. Empirical results on MOT Challenge datasets show state-of-the-art performance and real-time inference, highlighting the practical benefits of carefully balancing multi-task learning in MOT.

Abstract

Multi-object tracking (MOT) is an important problem in computer vision which has a wide range of applications. Formulating MOT as multi-task learning of object detection and re-ID in a single network is appealing since it allows joint optimization of the two tasks and enjoys high computation efficiency. However, we find that the two tasks tend to compete with each other which need to be carefully addressed. In particular, previous works usually treat re-ID as a secondary task whose accuracy is heavily affected by the primary detection task. As a result, the network is biased to the primary detection task which is not fair to the re-ID task. To solve the problem, we present a simple yet effective approach termed as FairMOT based on the anchor-free object detection architecture CenterNet. Note that it is not a naive combination of CenterNet and re-ID. Instead, we present a bunch of detailed designs which are critical to achieve good tracking results by thorough empirical studies. The resulting approach achieves high accuracy for both detection and tracking. The approach outperforms the state-of-the-art methods by a large margin on several public datasets. The source code and pre-trained models are released at https://github.com/ifzhang/FairMOT.

FairMOT: On the Fairness of Detection and Re-Identification in Multiple Object Tracking

TL;DR

FairMOT tackles the fairness conflict between object detection and re-ID in MOT by introducing an anchor-free, center-based single-shot network built on CenterNet. By identifying and mitigating three unfairness factors—anchor-based sampling, shared features, and high re-ID dimensionality—the approach achieves balanced, high-accuracy detection and tracking. It uses two homogeneous branches (detection and re-ID) trained with uncertainty-based loss balancing and a lightweight single-image pre-training strategy for data efficiency. Empirical results on MOT Challenge datasets show state-of-the-art performance and real-time inference, highlighting the practical benefits of carefully balancing multi-task learning in MOT.

Abstract

Multi-object tracking (MOT) is an important problem in computer vision which has a wide range of applications. Formulating MOT as multi-task learning of object detection and re-ID in a single network is appealing since it allows joint optimization of the two tasks and enjoys high computation efficiency. However, we find that the two tasks tend to compete with each other which need to be carefully addressed. In particular, previous works usually treat re-ID as a secondary task whose accuracy is heavily affected by the primary detection task. As a result, the network is biased to the primary detection task which is not fair to the re-ID task. To solve the problem, we present a simple yet effective approach termed as FairMOT based on the anchor-free object detection architecture CenterNet. Note that it is not a naive combination of CenterNet and re-ID. Instead, we present a bunch of detailed designs which are critical to achieve good tracking results by thorough empirical studies. The resulting approach achieves high accuracy for both detection and tracking. The approach outperforms the state-of-the-art methods by a large margin on several public datasets. The source code and pre-trained models are released at https://github.com/ifzhang/FairMOT.

Paper Structure

This paper contains 49 sections, 5 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Overview of our one-shot tracker FairMOT. The input image is first fed to an encoder-decoder network to extract high resolution feature maps (stride=$4$). Then we add two homogeneous branches for detecting objects and extracting re-ID features, respectively. The features at the predicted object centers are used for tracking.
  • Figure 2: (a) Track R-CNN treats detection as the primary task and re-ID as the secondary one. Both Track R-CNN and JDE are anchor-based. The red boxes represent positive anchors and the green boxes represent the target objects. The three methods extract re-ID features differently. Track R-CNN extracts re-ID features for all positive anchors using ROI-Align. JDE extracts re-ID features at the centers of all positive anchors. FairMOT extracts re-ID features at the object center. (b) The red anchor contains two different instances. So it will be forced to predict two conflicting classes. (c) Three different anchors with different image patches are response for predicting the same identity. (d) FairMOT extracts re-ID features only at the object center and can mitigate the problems in (b) and (c).
  • Figure 3: Visualization of the discriminative ability of the re-ID features. Query instances are marked as red boxes and target instances are marked as green boxes. The similarity maps are computed using re-ID features extracted based on different strategies (e.g., Center, Center-BI, ROI-Align and POS-Anchor as described in Section \ref{['sec:anchor']}) and different backbones (e.g., ResNet-34 and DLA-34). The query frames and target frames are randomly chosen from the MOT17-09 and the MOT17-02 sequence.
  • Figure 4: Time spent on different parts of our whole MOT system. We run tracking on sequences with different density from the MOT17 dataset and the MOT20 dataset.
  • Figure 5: Example tracking results of our method on the test set of MOT17. Each row shows the results of sampled frames in chronological order of a video sequence. Bounding boxes and identities are marked in the images. Bounding boxes with different colors represent different identities. Best viewed in color.