One-Shot Crowd Counting With Density Guidance For Scene Adaptaion
Jiwei Chen, Qi Wang, Junyu Gao, Jing Zhang, Dingyi Li, Jing-Jia Luo
TL;DR
This work tackles the generalization gap in crowd counting across unseen surveillance scenes by treating each scene as a category and leveraging a one-shot exemplar. It introduces LGD-OSCC, a dual-density framework that combines local density guidance, via three EM-derived prototypes capturing high/medium/low densities, with global density guidance implemented through a transformer to adapt query representations. Key contributions include the Multiple Local Density Learner for density prototype extraction, a local-to-global guidance mechanism, and an end-to-end learning strategy that alternates between base-model training and EM optimization on the support image. Empirical results on WorldExpo'10, Venice 2019Context, and CityUHK-X demonstrate strong generalization and superior performance over state-of-the-art few-shot crowd counting methods, highlighting the practical value for cross-scene surveillance analysis.
Abstract
Crowd scenes captured by cameras at different locations vary greatly, and existing crowd models have limited generalization for unseen surveillance scenes. To improve the generalization of the model, we regard different surveillance scenes as different category scenes, and introduce few-shot learning to make the model adapt to the unseen surveillance scene that belongs to the given exemplar category scene. To this end, we propose to leverage local and global density characteristics to guide the model of crowd counting for unseen surveillance scenes. Specifically, to enable the model to adapt to the varying density variations in the target scene, we propose the multiple local density learner to learn multi prototypes which represent different density distributions in the support scene. Subsequently, these multiple local density similarity matrixes are encoded. And they are utilized to guide the model in a local way. To further adapt to the global density in the target scene, the global density features are extracted from the support image, then it is used to guide the model in a global way. Experiments on three surveillance datasets shows that proposed method can adapt to the unseen surveillance scene and outperform recent state-of-the-art methods in the few-shot crowd counting.
