Table of Contents
Fetching ...

DenseTrack: Drone-based Crowd Tracking via Density-aware Motion-appearance Synergy

Yi Lei, Huilin Zhu, Jingling Yuan, Guangli Xiang, Xian Zhong, Shengfeng He

TL;DR

DenseTrack addresses drone-based crowd tracking challenges by unifying density-map localization with motion offsets and appearance cues. It extracts appearance from density-map crops via BLIP2, uses a motion-position map for inter-frame offsets, and applies diffusion-based appearance retrieval, with Hungarian matching to fuse cues across frames. The approach achieves state-of-the-art results on DroneCrowd (notably a T-mAP of 39.44) and is validated through extensive ablations demonstrating the importance of appearance, diffusion retrieval, and density-aware localization. This density-centric, cross-modal integration offers robust tracking in crowded drone imagery and provides a practical framework for real-world aerial surveillance and crowd monitoring.

Abstract

Drone-based crowd tracking faces difficulties in accurately identifying and monitoring objects from an aerial perspective, largely due to their small size and close proximity to each other, which complicates both localization and tracking. To address these challenges, we present the Density-aware Tracking (DenseTrack) framework. DenseTrack capitalizes on crowd counting to precisely determine object locations, blending visual and motion cues to improve the tracking of small-scale objects. It specifically addresses the problem of cross-frame motion to enhance tracking accuracy and dependability. DenseTrack employs crowd density estimates as anchors for exact object localization within video frames. These estimates are merged with motion and position information from the tracking network, with motion offsets serving as key tracking cues. Moreover, DenseTrack enhances the ability to distinguish small-scale objects using insights from the visual-language model, integrating appearance with motion cues. The framework utilizes the Hungarian algorithm to ensure the accurate matching of individuals across frames. Demonstrated on DroneCrowd dataset, our approach exhibits superior performance, confirming its effectiveness in scenarios captured by drones.

DenseTrack: Drone-based Crowd Tracking via Density-aware Motion-appearance Synergy

TL;DR

DenseTrack addresses drone-based crowd tracking challenges by unifying density-map localization with motion offsets and appearance cues. It extracts appearance from density-map crops via BLIP2, uses a motion-position map for inter-frame offsets, and applies diffusion-based appearance retrieval, with Hungarian matching to fuse cues across frames. The approach achieves state-of-the-art results on DroneCrowd (notably a T-mAP of 39.44) and is validated through extensive ablations demonstrating the importance of appearance, diffusion retrieval, and density-aware localization. This density-centric, cross-modal integration offers robust tracking in crowded drone imagery and provides a practical framework for real-world aerial surveillance and crowd monitoring.

Abstract

Drone-based crowd tracking faces difficulties in accurately identifying and monitoring objects from an aerial perspective, largely due to their small size and close proximity to each other, which complicates both localization and tracking. To address these challenges, we present the Density-aware Tracking (DenseTrack) framework. DenseTrack capitalizes on crowd counting to precisely determine object locations, blending visual and motion cues to improve the tracking of small-scale objects. It specifically addresses the problem of cross-frame motion to enhance tracking accuracy and dependability. DenseTrack employs crowd density estimates as anchors for exact object localization within video frames. These estimates are merged with motion and position information from the tracking network, with motion offsets serving as key tracking cues. Moreover, DenseTrack enhances the ability to distinguish small-scale objects using insights from the visual-language model, integrating appearance with motion cues. The framework utilizes the Hungarian algorithm to ensure the accurate matching of individuals across frames. Demonstrated on DroneCrowd dataset, our approach exhibits superior performance, confirming its effectiveness in scenarios captured by drones.
Paper Structure (28 sections, 17 equations, 8 figures, 7 tables, 1 algorithm)

This paper contains 28 sections, 17 equations, 8 figures, 7 tables, 1 algorithm.

Figures (8)

  • Figure 1: Illustration of localization and tracking techniques. The upper section contrasts (a) detection-based localization, which relies on identifying objects directly, with (b) counting-based localization, which estimates object positions through density analysis. The lower section highlights inaccuracies in (c) Tracking by Motion, where predictions are based on movement patterns, and (d) Tracking by Appearance, which uses visual features; identically colored points indicate predictions for the same individual.
  • Figure 2: DenseTrack is structured around three essential components: Localization, Individual Representation, and Association. Localization accurately determines the spatial positions of individuals in crowds through density maps. For Individual Representation, motion and appearance features are extracted by aligning density maps with motion and position maps (MPM) to provide motion cues, while the BLIP2 method is used to gather appearance cues. The Association component employs diffusion-based retrieval alongside a distance matrix derived from motion cues to facilitate precise inter-frame individual matching.
  • Figure 3: Illustration of tracking under different conditions. (a) Sparse small objects in cloudy weather conditions. (b) Dense small objects in sunny weather conditions, with the same color representing the same individual.
  • Figure 4: Illustration of tracking performance using different strategies across frames 10, 13, and 16: (a) original aerial image, (b) ground-truth annotations, (c) tracking based solely on appearance, (d) tracking based solely on motion, and (e) tracking integrating appearance and motion. Insets magnify tracking results, showcasing the performance of each strategy.
  • Figure 5: Comparison of different tracking methods across frames 1, 4, and 7: (a) original surveillance footage, (b) ground-truth annotations, (c) tracking results from STNNet, (d) tracking results from MPM, and (e) our DenseTrack results. False negatives are marked with white dotted circles, and tracking switch errors with white rectangles. Insets provide a detailed view of tracking discrepancies, using consistent color coding to identify each individual.
  • ...and 3 more figures