GMT: Effective Global Framework for Multi-Camera Multi-Target Tracking

Yihao Zhen; Mingyue Xu; Qiang Wang; Baojie Fan; Jiahua Dong; Tinghui Zhao; Huijie Fan

GMT: Effective Global Framework for Multi-Camera Multi-Target Tracking

Yihao Zhen, Mingyue Xu, Qiang Wang, Baojie Fan, Jiahua Dong, Tinghui Zhao, Huijie Fan

TL;DR

This work tackles MCMT tracking by addressing the underutilization of multi-view information in two-stage frameworks. It introduces GMT, a global framework that encodes the same targets across views into global trajectories and performs direct trajectory–target association via the GTA module, aided by the Cross-View Feature Consistency Enhancement (CFCE) to align features across views. The authors also present VisionTrack, a large-scale, diverse MCMT dataset collected with moving UAVs to better reflect real-world complexity. Empirically, GMT achieves substantial improvements over state-of-the-art trackers on VisionTrack and other benchmarks, particularly in cross-view matching and identity preservation, while maintaining efficient training and inference. Together, GMT and VisionTrack push forward the practical deployment of robust, multi-view MTMC tracking in complex environments.

Abstract

Multi-Camera Multi-Target (MCMT) tracking aims to locate and associate the same targets across multiple camera views. Existing methods typically adopt a two-stage framework, involving single-camera tracking followed by inter-camera tracking. However, in this paradigm, multi-view information is used only to recover missed matches in the first stage, providing a limited contribution to overall tracking. To address this issue, we propose GMT, a global MCMT tracking framework that jointly exploits intra-view and inter-view cues for tracking. Specifically, instead of assigning trajectories independently for each view, we integrate the same historical targets across different views as global trajectories, thereby reformulating the two-stage tracking as a unified global-level trajectory-target association process. We introduce a Cross-View Feature Consistency Enhancement (CFCE) module to align visual and spatial features across views, providing a consistent feature space for global trajectory modeling. With these aligned features, the Global Trajectory Association (GTA) module associates new detections with existing global trajectories, enabling direct use of multi-view information. Compared to the two-stage framework, GMT achieves significant improvements on existing datasets, with gains of up to 21.3 percent in CVMA and 17.2 percent in CVIDF1. Furthermore, we introduce VisionTrack, a high-quality, large-scale MCMT dataset providing significantly greater diversity than existing datasets. Our code and dataset will be released.

GMT: Effective Global Framework for Multi-Camera Multi-Target Tracking

TL;DR

Abstract

Paper Structure (20 sections, 11 equations, 6 figures, 5 tables)

This paper contains 20 sections, 11 equations, 6 figures, 5 tables.

Introduction
Related Work
One-stage MCMT Trackers
Two-stage MCMT Trackers
VisionTrack Dataset
Contribution of VisionTrack
Data Collection and Annotation
Method
Overview
Cross-View Feature Consistency Enhancement
Global-Level Trajectory Association
Training Procedure
Experiments
Implementation Details
Evaluation Metrics
...and 5 more sections

Figures (6)

Figure 1: Two-stage vs Global framework. Compared with (a) the two-stage paradigm, (b) the proposed global framework directly exploits information from all views for tracking by encoding all historical targets into global trajectories.
Figure 2: Quantitative analysis of VisionTrack dataset. (a) The red line represents the average size of each target in different scenes, the bar chart represents the average number of targets per frame. (b) The proportion of different scenes and weather conditions in VisionTrack.
Figure 3: Overview of the Global MCMT tracking framework(GMT). GMT is composed of the CFCE module, which enhances intra-trajectory consistency, and the GTA module, which enables trajectory encoding, target encoding and target-trajectory interaction. By encoding the same targets from different views into global trajectories and performing global-level trajectory-target interaction, GMT could directly utilize multi-view information for tracking while avoiding the unnecessary cross-view matching.
Figure 4: The structure of the GTA module. The GTA module implements the global-level trajectory-target association after encoding the global trajectory and the current target features.
Figure 5: Qualitative comparison of single-view tracking (EPFL 01) and cross-view matching (park, path) under challenging scenarios. Colors of the target bounding boxes denote distinct target IDs. ‘Fail’ indicates that the tracker fails to maintain tracking of the target and treats it as a newborn target, while ‘Wrong’ denotes an incorrect association with a different target.
...and 1 more figures

GMT: Effective Global Framework for Multi-Camera Multi-Target Tracking

TL;DR

Abstract

GMT: Effective Global Framework for Multi-Camera Multi-Target Tracking

Authors

TL;DR

Abstract

Table of Contents

Figures (6)