Table of Contents
Fetching ...

DynamicTrack: Advancing Gigapixel Tracking in Crowded Scenes

Yunqi Zhao, Yuchen Guo, Zheng Cao, Kai Ni, Ruqi Huang, Lu Fang

TL;DR

DynamicTrack tackles gigapixel crowded-scene tracking under severe occlusion by introducing a head–body joint detector trained with contrastive (Associative Embedding) learning and a cascade-based dynamic association that fuses head and body cues. The approach integrates an embedding branch into a Faster-RCNN backbone and optimizes head–body correspondences with $L_{pull}$ and $L_{push}$ losses, achieving robust matching through cascade association. Empirically, DynamicTrack delivers state-of-the-art results on MOT20 and PANDA among two-stage trackers, with ablations confirming substantial gains from head cues and head–body joint optimization. The work advances practical gigapixel tracking in crowded scenes, with implications for surveillance and pedestrian analysis, and points to future integration with transformer-based detectors for further improvements.

Abstract

Tracking in gigapixel scenarios holds numerous potential applications in video surveillance and pedestrian analysis. Existing algorithms attempt to perform tracking in crowded scenes by utilizing multiple cameras or group relationships. However, their performance significantly degrades when confronted with complex interaction and occlusion inherent in gigapixel images. In this paper, we introduce DynamicTrack, a dynamic tracking framework designed to address gigapixel tracking challenges in crowded scenes. In particular, we propose a dynamic detector that utilizes contrastive learning to jointly detect the head and body of pedestrians. Building upon this, we design a dynamic association algorithm that effectively utilizes head and body information for matching purposes. Extensive experiments show that our tracker achieves state-of-the-art performance on widely used tracking benchmarks specifically designed for gigapixel crowded scenes.

DynamicTrack: Advancing Gigapixel Tracking in Crowded Scenes

TL;DR

DynamicTrack tackles gigapixel crowded-scene tracking under severe occlusion by introducing a head–body joint detector trained with contrastive (Associative Embedding) learning and a cascade-based dynamic association that fuses head and body cues. The approach integrates an embedding branch into a Faster-RCNN backbone and optimizes head–body correspondences with and losses, achieving robust matching through cascade association. Empirically, DynamicTrack delivers state-of-the-art results on MOT20 and PANDA among two-stage trackers, with ablations confirming substantial gains from head cues and head–body joint optimization. The work advances practical gigapixel tracking in crowded scenes, with implications for surveillance and pedestrian analysis, and points to future integration with transformer-based detectors for further improvements.

Abstract

Tracking in gigapixel scenarios holds numerous potential applications in video surveillance and pedestrian analysis. Existing algorithms attempt to perform tracking in crowded scenes by utilizing multiple cameras or group relationships. However, their performance significantly degrades when confronted with complex interaction and occlusion inherent in gigapixel images. In this paper, we introduce DynamicTrack, a dynamic tracking framework designed to address gigapixel tracking challenges in crowded scenes. In particular, we propose a dynamic detector that utilizes contrastive learning to jointly detect the head and body of pedestrians. Building upon this, we design a dynamic association algorithm that effectively utilizes head and body information for matching purposes. Extensive experiments show that our tracker achieves state-of-the-art performance on widely used tracking benchmarks specifically designed for gigapixel crowded scenes.
Paper Structure (10 sections, 7 equations, 5 figures, 5 tables, 1 algorithm)

This paper contains 10 sections, 7 equations, 5 figures, 5 tables, 1 algorithm.

Figures (5)

  • Figure 1: The comparison between body tracking and head-body tracking: a. Body tracking encounters ID switch and fragment in interactive and occluded scenarios. b. Head-body tracking is robust in crowded scenes.
  • Figure 2: Overview of DynamicTrack framework for gigapixel tracking. Dynamic Detection: Contrastive learning-based detector achieves simultaneous detection of both the body and the head for pedestrian tracking. Dynamic Association: Dynamically utilizing head and body of the same identity for matching to achieve robust tracking in crowded scenes."
  • Figure 3: The framework of our dynamic detector for head-body detection consists of a modified version of the classical two-stage detector, Faster-RCNNren2015faster. We introduce an additional branch for embedding learning and leverage an associative embedding loss based on contrastive learning for supervision.
  • Figure 4: Visualization results of DynamicTrack. We have selected gigapixel sequences from the test set of PANDA to demonstrate the effectiveness of DynamicTrack in handling complex crowded scenarios. In our visualizations, we utilize customizable visualization windows represented by green and blue rectangles. Additionally, we use colors to indicate different identities, with the same bounding box color indicating the same identity.
  • Figure 5: Visualization of detection results and matching head and body pairs on CrowdHuman test set.