Table of Contents
Fetching ...

MCTR: Multi Camera Tracking Transformer

Alexandru Niculescu-Mizil, Deep Patel, Iain Melvin

TL;DR

MCTR addresses the problem of robust multi-camera multi-object tracking with overlapping fields of view by proposing an end-to-end transformer-based architecture that maintains global track embeddings updated from per-view detections. The method combines a DETR-style Detection Module, a Tracking Module that fuses information across views, and an Association Module that yields differentiable cross-view assignments, trained with a joint loss that links local detections to global tracks. Key contributions include a probabilistic, differentiable cross-view association mechanism and a training protocol that enables end-to-end learning across multiple cameras, demonstrated on the MMPTrack and AI City Challenge datasets. The work shows that end-to-end multi-camera tracking is feasible and competitive with more heuristic pipelines, with practical implications for robust surveillance and multi-view scene understanding, while also highlighting areas for improvement in long-term identity maintenance and cross-view generalization.

Abstract

Multi-camera tracking plays a pivotal role in various real-world applications. While end-to-end methods have gained significant interest in single-camera tracking, multi-camera tracking remains predominantly reliant on heuristic techniques. In response to this gap, this paper introduces Multi-Camera Tracking tRansformer (MCTR), a novel end-to-end approach tailored for multi-object detection and tracking across multiple cameras with overlapping fields of view. MCTR leverages end-to-end detectors like DEtector TRansformer (DETR) to produce detections and detection embeddings independently for each camera view. The framework maintains set of track embeddings that encaplusate global information about the tracked objects, and updates them at every frame by integrating the local information from the view-specific detection embeddings. The track embeddings are probabilistically associated with detections in every camera view and frame to generate consistent object tracks. The soft probabilistic association facilitates the design of differentiable losses that enable end-to-end training of the entire system. To validate our approach, we conduct experiments on MMPTrack and AI City Challenge, two recently introduced large-scale multi-camera multi-object tracking datasets.

MCTR: Multi Camera Tracking Transformer

TL;DR

MCTR addresses the problem of robust multi-camera multi-object tracking with overlapping fields of view by proposing an end-to-end transformer-based architecture that maintains global track embeddings updated from per-view detections. The method combines a DETR-style Detection Module, a Tracking Module that fuses information across views, and an Association Module that yields differentiable cross-view assignments, trained with a joint loss that links local detections to global tracks. Key contributions include a probabilistic, differentiable cross-view association mechanism and a training protocol that enables end-to-end learning across multiple cameras, demonstrated on the MMPTrack and AI City Challenge datasets. The work shows that end-to-end multi-camera tracking is feasible and competitive with more heuristic pipelines, with practical implications for robust surveillance and multi-view scene understanding, while also highlighting areas for improvement in long-term identity maintenance and cross-view generalization.

Abstract

Multi-camera tracking plays a pivotal role in various real-world applications. While end-to-end methods have gained significant interest in single-camera tracking, multi-camera tracking remains predominantly reliant on heuristic techniques. In response to this gap, this paper introduces Multi-Camera Tracking tRansformer (MCTR), a novel end-to-end approach tailored for multi-object detection and tracking across multiple cameras with overlapping fields of view. MCTR leverages end-to-end detectors like DEtector TRansformer (DETR) to produce detections and detection embeddings independently for each camera view. The framework maintains set of track embeddings that encaplusate global information about the tracked objects, and updates them at every frame by integrating the local information from the view-specific detection embeddings. The track embeddings are probabilistically associated with detections in every camera view and frame to generate consistent object tracks. The soft probabilistic association facilitates the design of differentiable losses that enable end-to-end training of the entire system. To validate our approach, we conduct experiments on MMPTrack and AI City Challenge, two recently introduced large-scale multi-camera multi-object tracking datasets.
Paper Structure (15 sections, 4 equations, 9 figures, 2 tables)

This paper contains 15 sections, 4 equations, 9 figures, 2 tables.

Figures (9)

  • Figure 1: Example of multi-camera frame from the MMPTrack dataset with 6 camera angles.
  • Figure 2: Model Overview.
  • Figure 3: Detection Module: DETR.
  • Figure 4: Tracking Module.
  • Figure 5: Association Module.
  • ...and 4 more figures