Table of Contents
Fetching ...

TOPIC: A Parallel Association Paradigm for Multi-Object Tracking under Complex Motions and Diverse Scenes

Xiaoyan Cao, Yiyao Zheng, Yao Yao, Huapeng Qin, Xiaoyu Cao, Shihui Guo

TL;DR

This work addresses MOT in scenarios with complex, diverse motions by introducing a parallel association paradigm (TOPIC) that jointly leverages motion and appearance cues. It also provides BEE24, a bee-focused dataset with long sequences, small objects, occlusion, and highly variable motion to challenge existing trackers. TOPIC is implemented with a two-round matching mechanism that adaptively selects between appearance- and motion-based matches based on a motion-level threshold, and is enhanced by the Attention-based Appearance Reconstruction Module (AARM) to improve identity discrimination without additional training. Across five datasets, including BEE24, TOPICTrack achieves state-of-the-art results and demonstrates significant reductions in false negatives, while ablation studies confirm the effectiveness of both TOPIC and AARM in improving robustness for complex motions and diverse scenes.

Abstract

Video data and algorithms have been driving advances in multi-object tracking (MOT). While existing MOT datasets focus on occlusion and appearance similarity, complex motion patterns are widespread yet overlooked. To address this issue, we introduce a new dataset called BEE24 to highlight complex motions. Identity association algorithms have long been the focus of MOT research. Existing trackers can be categorized into two association paradigms: single-feature paradigm (based on either motion or appearance feature) and serial paradigm (one feature serves as secondary while the other is primary). However, these paradigms are incapable of fully utilizing different features. In this paper, we propose a parallel paradigm and present the Two rOund Parallel matchIng meChanism (TOPIC) to implement it. The TOPIC leverages both motion and appearance features and can adaptively select the preferable one as the assignment metric based on motion level. Moreover, we provide an Attention-based Appearance Reconstruction Module (AARM) to reconstruct appearance feature embeddings, thus enhancing the representation of appearance features. Comprehensive experiments show that our approach achieves state-of-the-art performance on four public datasets and BEE24. Moreover, BEE24 challenges existing trackers to track multiple similar-appearing small objects with complex motions over long periods, which is critical in real-world applications such as beekeeping and drone swarm surveillance. Notably, our proposed parallel paradigm surpasses the performance of existing association paradigms by a large margin, e.g., reducing false negatives by 6% to 81% compared to the single-feature association paradigm. The introduced dataset and association paradigm in this work offer a fresh perspective for advancing the MOT field. The source code and dataset are available at https://github.com/holmescao/TOPICTrack.

TOPIC: A Parallel Association Paradigm for Multi-Object Tracking under Complex Motions and Diverse Scenes

TL;DR

This work addresses MOT in scenarios with complex, diverse motions by introducing a parallel association paradigm (TOPIC) that jointly leverages motion and appearance cues. It also provides BEE24, a bee-focused dataset with long sequences, small objects, occlusion, and highly variable motion to challenge existing trackers. TOPIC is implemented with a two-round matching mechanism that adaptively selects between appearance- and motion-based matches based on a motion-level threshold, and is enhanced by the Attention-based Appearance Reconstruction Module (AARM) to improve identity discrimination without additional training. Across five datasets, including BEE24, TOPICTrack achieves state-of-the-art results and demonstrates significant reductions in false negatives, while ablation studies confirm the effectiveness of both TOPIC and AARM in improving robustness for complex motions and diverse scenes.

Abstract

Video data and algorithms have been driving advances in multi-object tracking (MOT). While existing MOT datasets focus on occlusion and appearance similarity, complex motion patterns are widespread yet overlooked. To address this issue, we introduce a new dataset called BEE24 to highlight complex motions. Identity association algorithms have long been the focus of MOT research. Existing trackers can be categorized into two association paradigms: single-feature paradigm (based on either motion or appearance feature) and serial paradigm (one feature serves as secondary while the other is primary). However, these paradigms are incapable of fully utilizing different features. In this paper, we propose a parallel paradigm and present the Two rOund Parallel matchIng meChanism (TOPIC) to implement it. The TOPIC leverages both motion and appearance features and can adaptively select the preferable one as the assignment metric based on motion level. Moreover, we provide an Attention-based Appearance Reconstruction Module (AARM) to reconstruct appearance feature embeddings, thus enhancing the representation of appearance features. Comprehensive experiments show that our approach achieves state-of-the-art performance on four public datasets and BEE24. Moreover, BEE24 challenges existing trackers to track multiple similar-appearing small objects with complex motions over long periods, which is critical in real-world applications such as beekeeping and drone swarm surveillance. Notably, our proposed parallel paradigm surpasses the performance of existing association paradigms by a large margin, e.g., reducing false negatives by 6% to 81% compared to the single-feature association paradigm. The introduced dataset and association paradigm in this work offer a fresh perspective for advancing the MOT field. The source code and dataset are available at https://github.com/holmescao/TOPICTrack.
Paper Structure (24 sections, 13 equations, 12 figures, 7 tables, 1 algorithm)

This paper contains 24 sections, 13 equations, 12 figures, 7 tables, 1 algorithm.

Figures (12)

  • Figure 1: Comparison of the properties of different datasets. In addition to the properties of occlusion and highly similar appearance, the property of complex motion patterns is remarkable in BEE24. This can be seen in the diversity of motion patterns between objects and the variability of motion patterns of a single object. In the legend, "Complex" and "Simple" denote the objects with the most complex and simplest motion patterns in the scene, respectively.
  • Figure 2: Comparison of existing association paradigms with our proposed parallel paradigm. (a) the single-feature association paradigm, either uses motion or appearance feature as assignment metric; (b) the serial association paradigm, manually specifies a feature to filter association candidates, followed by another feature as the primary assignment metric, which resembles taking the "intersection" of motion and appearance matches; (c) our proposed parallel association paradigm, uses motion and appearance features as assignment metrics in parallel, like taking the union set, and can resolve conflicts.
  • Figure 3: Comparison of motion pattern complexity between BEE24 and four popular datasets. (a) the diversity of motion patterns among objects; (b) the variability of motion patterns of a single object across frames. "G" in xticks stands for GMOT-40.
  • Figure 4: Comparison of the performance of existing association paradigms on different scenes. The first row shows tracking a flying bee (high-speed); the second row shows tracking an occluded bee (low-speed).
  • Figure 5: Overview of the AARM. Taking the similarity metric of the same bee as an example, we first use the history trajectory of the bee and the current detection's appearance embeddings $t_{(1,1)}$ and $d_1$ to compute the attention map $R_{(1,1),1}$. Next, by softmax operation on the attention map to get $R^d_{(1,1),1}$, and then after transposition to get $R^t_{(1,1),1}$, thus obtaining two cross-attention maps. Afterward, appearance embeddings $t_{(1,1)}$ and $d_1$ are reconstructed via the residual attention mechanism. After reconstruction, the similarity score of appearance embeddings of the same bee is increased, e.g., from 0.8 to 0.9. And vice versa for different bees.
  • ...and 7 more figures