Table of Contents
Fetching ...

One-Shot Crowd Counting With Density Guidance For Scene Adaptaion

Jiwei Chen, Qi Wang, Junyu Gao, Jing Zhang, Dingyi Li, Jing-Jia Luo

TL;DR

This work tackles the generalization gap in crowd counting across unseen surveillance scenes by treating each scene as a category and leveraging a one-shot exemplar. It introduces LGD-OSCC, a dual-density framework that combines local density guidance, via three EM-derived prototypes capturing high/medium/low densities, with global density guidance implemented through a transformer to adapt query representations. Key contributions include the Multiple Local Density Learner for density prototype extraction, a local-to-global guidance mechanism, and an end-to-end learning strategy that alternates between base-model training and EM optimization on the support image. Empirical results on WorldExpo'10, Venice 2019Context, and CityUHK-X demonstrate strong generalization and superior performance over state-of-the-art few-shot crowd counting methods, highlighting the practical value for cross-scene surveillance analysis.

Abstract

Crowd scenes captured by cameras at different locations vary greatly, and existing crowd models have limited generalization for unseen surveillance scenes. To improve the generalization of the model, we regard different surveillance scenes as different category scenes, and introduce few-shot learning to make the model adapt to the unseen surveillance scene that belongs to the given exemplar category scene. To this end, we propose to leverage local and global density characteristics to guide the model of crowd counting for unseen surveillance scenes. Specifically, to enable the model to adapt to the varying density variations in the target scene, we propose the multiple local density learner to learn multi prototypes which represent different density distributions in the support scene. Subsequently, these multiple local density similarity matrixes are encoded. And they are utilized to guide the model in a local way. To further adapt to the global density in the target scene, the global density features are extracted from the support image, then it is used to guide the model in a global way. Experiments on three surveillance datasets shows that proposed method can adapt to the unseen surveillance scene and outperform recent state-of-the-art methods in the few-shot crowd counting.

One-Shot Crowd Counting With Density Guidance For Scene Adaptaion

TL;DR

This work tackles the generalization gap in crowd counting across unseen surveillance scenes by treating each scene as a category and leveraging a one-shot exemplar. It introduces LGD-OSCC, a dual-density framework that combines local density guidance, via three EM-derived prototypes capturing high/medium/low densities, with global density guidance implemented through a transformer to adapt query representations. Key contributions include the Multiple Local Density Learner for density prototype extraction, a local-to-global guidance mechanism, and an end-to-end learning strategy that alternates between base-model training and EM optimization on the support image. Empirical results on WorldExpo'10, Venice 2019Context, and CityUHK-X demonstrate strong generalization and superior performance over state-of-the-art few-shot crowd counting methods, highlighting the practical value for cross-scene surveillance analysis.

Abstract

Crowd scenes captured by cameras at different locations vary greatly, and existing crowd models have limited generalization for unseen surveillance scenes. To improve the generalization of the model, we regard different surveillance scenes as different category scenes, and introduce few-shot learning to make the model adapt to the unseen surveillance scene that belongs to the given exemplar category scene. To this end, we propose to leverage local and global density characteristics to guide the model of crowd counting for unseen surveillance scenes. Specifically, to enable the model to adapt to the varying density variations in the target scene, we propose the multiple local density learner to learn multi prototypes which represent different density distributions in the support scene. Subsequently, these multiple local density similarity matrixes are encoded. And they are utilized to guide the model in a local way. To further adapt to the global density in the target scene, the global density features are extracted from the support image, then it is used to guide the model in a global way. Experiments on three surveillance datasets shows that proposed method can adapt to the unseen surveillance scene and outperform recent state-of-the-art methods in the few-shot crowd counting.
Paper Structure (15 sections, 9 equations, 4 figures, 5 tables, 1 algorithm)

This paper contains 15 sections, 9 equations, 4 figures, 5 tables, 1 algorithm.

Figures (4)

  • Figure 1: The pipeline of proposed model. Support image and query image are collected from the same category surveillance scene. The multiple local density learner utilizes support and query features to encode these local density similarity matrixes. And they are employed to guide the model in a local way. The global density feature is used to guide the model in a global way. Query ET-DM represents the estimated density map of query image.
  • Figure 2: The architecture of the proposed LGD-OSCC (best viewed in color). It employs a dual-branch architecture guided by density features. In the support branch, the ground-truth density map is mapped to support features. On the one hand, the mapping results‌ are employed to encode global density features. On the other hand, they are optimized into density multi prototypes by the EM algorithm in multiple local density learner, which are leveraged to encode these local density similarity matrixes. In the query branch, these local density similarity matrixes are used to guide the model in a local way. Subsequently, transformer utilizes the global density features for global guidance. LDSM represents these local density similarity matrixes. GT-DM represents the ground-truth density map. ET-DM represents the estimated density map. $\hbox{o}rigin=c]{45}{$⊕$}$ represents the element-wise multiplication. $\oplus$ represents the element-wise sum.
  • Figure 3: Visualization results on the Venice. First column: test images from three unseen surveillance scenes. Second column: ground-truth density map. Third column: the predicted query density map by the proposed LGD-OSCC. GT represents the ground-truth counting result. ET represents the estimated counting result.
  • Figure 4: Visualization results on the CityUHK-X. First column: test images from three unseen surveillance scenes. Second column: ground-truth density map. Third column: the predicted query density map by the proposed LGD-OSCC. GT represents the ground-truth counting result. ET represents the estimated counting result.