DenseTrack: Drone-based Crowd Tracking via Density-aware Motion-appearance Synergy
Yi Lei, Huilin Zhu, Jingling Yuan, Guangli Xiang, Xian Zhong, Shengfeng He
TL;DR
DenseTrack addresses drone-based crowd tracking challenges by unifying density-map localization with motion offsets and appearance cues. It extracts appearance from density-map crops via BLIP2, uses a motion-position map for inter-frame offsets, and applies diffusion-based appearance retrieval, with Hungarian matching to fuse cues across frames. The approach achieves state-of-the-art results on DroneCrowd (notably a T-mAP of 39.44) and is validated through extensive ablations demonstrating the importance of appearance, diffusion retrieval, and density-aware localization. This density-centric, cross-modal integration offers robust tracking in crowded drone imagery and provides a practical framework for real-world aerial surveillance and crowd monitoring.
Abstract
Drone-based crowd tracking faces difficulties in accurately identifying and monitoring objects from an aerial perspective, largely due to their small size and close proximity to each other, which complicates both localization and tracking. To address these challenges, we present the Density-aware Tracking (DenseTrack) framework. DenseTrack capitalizes on crowd counting to precisely determine object locations, blending visual and motion cues to improve the tracking of small-scale objects. It specifically addresses the problem of cross-frame motion to enhance tracking accuracy and dependability. DenseTrack employs crowd density estimates as anchors for exact object localization within video frames. These estimates are merged with motion and position information from the tracking network, with motion offsets serving as key tracking cues. Moreover, DenseTrack enhances the ability to distinguish small-scale objects using insights from the visual-language model, integrating appearance with motion cues. The framework utilizes the Hungarian algorithm to ensure the accurate matching of individuals across frames. Demonstrated on DroneCrowd dataset, our approach exhibits superior performance, confirming its effectiveness in scenarios captured by drones.
