Table of Contents
Fetching ...

FusionTrack: End-to-End Multi-Object Tracking in Arbitrary Multi-View Environment

Xiaohe Li, Pengfei Li, Zide Fan, Ying Geng, Fangli Mou, Haohua Wu, Yunping Ge

TL;DR

This work addresses the challenge of free-viewpoint multi-view MOT by introducing the MDMOT drone-based dataset and a unified Transformer-based framework, FusionTrack, that jointly optimizes single-view tracking and cross-view ReID. FusionTrack features a Tracklet Memory Pool, an Object Update Module for spatiotemporal feature rectification, and a Neighbor Filtering Mechanism with Viewpoint-guided Hierarchical Clustering to achieve robust cross-view identity matching, including an optimal-transport-based identity optimization strategy. The approach delivers state-of-the-art results on MDMOT and strong performance on standard multi-view pedestrian benchmarks, demonstrating improved robustness to view changes, occlusions, and small object sizes. The results indicate practical impact for real-world drone surveillance, traffic analytics, and urban management where flexible, scalable, and accurate multi-view tracking is essential.

Abstract

Multi-view multi-object tracking (MVMOT) has found widespread applications in intelligent transportation, surveillance systems, and urban management. However, existing studies rarely address genuinely free-viewpoint MVMOT systems, which could significantly enhance the flexibility and scalability of cooperative tracking systems. To bridge this gap, we first construct the Multi-Drone Multi-Object Tracking (MDMOT) dataset, captured by mobile drone swarms across diverse real-world scenarios, initially establishing the first benchmark for multi-object tracking in arbitrary multi-view environment. Building upon this foundation, we propose \textbf{FusionTrack}, an end-to-end framework that reasonably integrates tracking and re-identification to leverage multi-view information for robust trajectory association. Extensive experiments on our MDMOT and other benchmark datasets demonstrate that FusionTrack achieves state-of-the-art performance in both single-view and multi-view tracking.

FusionTrack: End-to-End Multi-Object Tracking in Arbitrary Multi-View Environment

TL;DR

This work addresses the challenge of free-viewpoint multi-view MOT by introducing the MDMOT drone-based dataset and a unified Transformer-based framework, FusionTrack, that jointly optimizes single-view tracking and cross-view ReID. FusionTrack features a Tracklet Memory Pool, an Object Update Module for spatiotemporal feature rectification, and a Neighbor Filtering Mechanism with Viewpoint-guided Hierarchical Clustering to achieve robust cross-view identity matching, including an optimal-transport-based identity optimization strategy. The approach delivers state-of-the-art results on MDMOT and strong performance on standard multi-view pedestrian benchmarks, demonstrating improved robustness to view changes, occlusions, and small object sizes. The results indicate practical impact for real-world drone surveillance, traffic analytics, and urban management where flexible, scalable, and accurate multi-view tracking is essential.

Abstract

Multi-view multi-object tracking (MVMOT) has found widespread applications in intelligent transportation, surveillance systems, and urban management. However, existing studies rarely address genuinely free-viewpoint MVMOT systems, which could significantly enhance the flexibility and scalability of cooperative tracking systems. To bridge this gap, we first construct the Multi-Drone Multi-Object Tracking (MDMOT) dataset, captured by mobile drone swarms across diverse real-world scenarios, initially establishing the first benchmark for multi-object tracking in arbitrary multi-view environment. Building upon this foundation, we propose \textbf{FusionTrack}, an end-to-end framework that reasonably integrates tracking and re-identification to leverage multi-view information for robust trajectory association. Extensive experiments on our MDMOT and other benchmark datasets demonstrate that FusionTrack achieves state-of-the-art performance in both single-view and multi-view tracking.

Paper Structure

This paper contains 36 sections, 15 equations, 12 figures, 5 tables.

Figures (12)

  • Figure 1: Examples of the MDMOT dataset. The left panel illustrates the spatial distribution of drones at a specific moment within the overpass scenario, along with the corresponding multi-view imagery they captured, clearly highlighting both overlapping and non-overlapping regions. The top-right quadrant presents statistical analysis of bounding box distributions and identity frequencies across six representative training scenes. The bottom-right section displays a word cloud visualizing prominent feature terms in the dataset.
  • Figure 2: Image examples under varying weather conditions. From top-left to bottom-right: sunny, cloudy, dusk, and nighttime.
  • Figure 3: (a) Heatmap of object location distribution, where the x and y axes represent normalized image coordinates, and the z-axis indicates the frequency of object occurrence. (b) Heatmap of box size distribution, with the x and y axes indicating the relative width and height of boxes.
  • Figure 4: Comparison of dataset statistics, with blue indicating the total number of bounding boxes, green representing the total number of frames, and red denoting the average number of boxes per frame.
  • Figure 5: Overview of our FusionTrack framework. It comprises the Single-view Tracking module, which tracks objects independently within each view. The Tracklet Memory Pool stores queries collected from multiple views and across temporal frames. The Trajectory ReID module extracts discriminative ReID features to facilitate robust cross-view association. And the Object Update Module refines current-frame queries by integrating spatiotemporal context, producing enhanced representations.
  • ...and 7 more figures