Table of Contents
Fetching ...

Joint Counting, Detection and Re-Identification for Multi-Object Tracking

Weihong Ren, Denglu Wu, Hui Cao, Xi'ai Chen, Zhi Han, Honghai Liu

TL;DR

CountingMOT introduces a multi-task framework that jointly learns crowd counting (via density maps), object detection, and reID for multi-object tracking in crowded scenes. It enforces mutual constraints between the detection outputs and the crowd density map, enabling recovery of missed detections and rejection of false positives while maintaining online real-time performance. The method achieves state-of-the-art MOTA on MOT16 and MOT17 and strong results on MOT20, demonstrating that density-based counting can robustly guide localization and data association in challenging crowds. Overall, the work bridges counting, detection, and reID in a unified trainable model, yielding improved tracking reliability in dense environments and suggesting avenues for further enhancement of ReID and long-term associations.

Abstract

The recent trend in 2D multiple object tracking (MOT) is jointly solving detection and tracking, where object detection and appearance feature (or motion) are learned simultaneously. Despite competitive performance, in crowded scenes, joint detection and tracking usually fail to find accurate object associations due to missed or false detections. In this paper, we jointly model counting, detection and re-identification in an end-to-end framework, named CountingMOT, tailored for crowded scenes. By imposing mutual object-count constraints between detection and counting, the CountingMOT tries to find a balance between object detection and crowd density map estimation, which can help it to recover missed detections or reject false detections. Our approach is an attempt to bridge the gap of object detection, counting, and re-Identification. This is in contrast to prior MOT methods that either ignore the crowd density and thus are prone to failure in crowded scenes,or depend on local correlations to build a graphical relationship for matching targets. The proposed MOT tracker can perform online and real-time tracking, and achieves the state-of-the-art results on public benchmarks MOT16 (MOTA of 79.7), MOT17 (MOTA of 81.3%) and MOT20 (MOTA of 78.9%).

Joint Counting, Detection and Re-Identification for Multi-Object Tracking

TL;DR

CountingMOT introduces a multi-task framework that jointly learns crowd counting (via density maps), object detection, and reID for multi-object tracking in crowded scenes. It enforces mutual constraints between the detection outputs and the crowd density map, enabling recovery of missed detections and rejection of false positives while maintaining online real-time performance. The method achieves state-of-the-art MOTA on MOT16 and MOT17 and strong results on MOT20, demonstrating that density-based counting can robustly guide localization and data association in challenging crowds. Overall, the work bridges counting, detection, and reID in a unified trainable model, yielding improved tracking reliability in dense environments and suggesting avenues for further enhancement of ReID and long-term associations.

Abstract

The recent trend in 2D multiple object tracking (MOT) is jointly solving detection and tracking, where object detection and appearance feature (or motion) are learned simultaneously. Despite competitive performance, in crowded scenes, joint detection and tracking usually fail to find accurate object associations due to missed or false detections. In this paper, we jointly model counting, detection and re-identification in an end-to-end framework, named CountingMOT, tailored for crowded scenes. By imposing mutual object-count constraints between detection and counting, the CountingMOT tries to find a balance between object detection and crowd density map estimation, which can help it to recover missed detections or reject false detections. Our approach is an attempt to bridge the gap of object detection, counting, and re-Identification. This is in contrast to prior MOT methods that either ignore the crowd density and thus are prone to failure in crowded scenes,or depend on local correlations to build a graphical relationship for matching targets. The proposed MOT tracker can perform online and real-time tracking, and achieves the state-of-the-art results on public benchmarks MOT16 (MOTA of 79.7), MOT17 (MOTA of 81.3%) and MOT20 (MOTA of 78.9%).
Paper Structure (25 sections, 13 equations, 6 figures, 6 tables)

This paper contains 25 sections, 13 equations, 6 figures, 6 tables.

Figures (6)

  • Figure 1: Object detections with counting constraint in a crowd scene of MOT20. For clear visualization, we only show part of the object detections in the scene. Object detections in the top-right are generated by the state-of-the-art method FairMOT zhang2021fairmot, which jointly produces object detections and reID features. However, FairMOT fails to locate occluded people in extremely crowd regions. By incorporating crowd density map (bottom-left) as a counting constraint, our proposed CountingMOT finds missed object detections (green boxes in the bottom-right) and also can eliminate false detections (the red box in the bottom-right).
  • Figure 2: The proposed CountingMOT model for joint Counting, Detection, and re-Identification. The input image is first fed to the backbone for multi-level feature extraction. Then, we add three homogeneous branches for simultaneously performing detection, counting and reID, respectively. Also, we create mutual constraints between detection and counting to improve detections in crowd scenes. The reID branch is used to generate appearance feature for data association.
  • Figure 3: Qualitative results on MOT17-07 test set. As observed, all the trackers except ours loss object detections (marked with red arrows) and thus have relatively worse tracking performance. Also, our tracker has more object count.
  • Figure 4: Qualitative results on MOT17-20 test set. For the crowd scene, CSTrack liang2022rethinking and Trackformer meinhardt2022trackformer lose too many object detections (see the "Num" in the figure), while our CountingMOT tracker has the most object count, which implicitly indicates that crowd density map indeed helps to locate occluded persons (zoom in for clear visualization).
  • Figure 5: Training Curves for $w_1$, $w_2$, $w_3$.
  • ...and 1 more figures