FusionTrack: End-to-End Multi-Object Tracking in Arbitrary Multi-View Environment
Xiaohe Li, Pengfei Li, Zide Fan, Ying Geng, Fangli Mou, Haohua Wu, Yunping Ge
TL;DR
This work addresses the challenge of free-viewpoint multi-view MOT by introducing the MDMOT drone-based dataset and a unified Transformer-based framework, FusionTrack, that jointly optimizes single-view tracking and cross-view ReID. FusionTrack features a Tracklet Memory Pool, an Object Update Module for spatiotemporal feature rectification, and a Neighbor Filtering Mechanism with Viewpoint-guided Hierarchical Clustering to achieve robust cross-view identity matching, including an optimal-transport-based identity optimization strategy. The approach delivers state-of-the-art results on MDMOT and strong performance on standard multi-view pedestrian benchmarks, demonstrating improved robustness to view changes, occlusions, and small object sizes. The results indicate practical impact for real-world drone surveillance, traffic analytics, and urban management where flexible, scalable, and accurate multi-view tracking is essential.
Abstract
Multi-view multi-object tracking (MVMOT) has found widespread applications in intelligent transportation, surveillance systems, and urban management. However, existing studies rarely address genuinely free-viewpoint MVMOT systems, which could significantly enhance the flexibility and scalability of cooperative tracking systems. To bridge this gap, we first construct the Multi-Drone Multi-Object Tracking (MDMOT) dataset, captured by mobile drone swarms across diverse real-world scenarios, initially establishing the first benchmark for multi-object tracking in arbitrary multi-view environment. Building upon this foundation, we propose \textbf{FusionTrack}, an end-to-end framework that reasonably integrates tracking and re-identification to leverage multi-view information for robust trajectory association. Extensive experiments on our MDMOT and other benchmark datasets demonstrate that FusionTrack achieves state-of-the-art performance in both single-view and multi-view tracking.
