WSCF-MVCC: Weakly-supervised Calibration-free Multi-view Crowd Counting
Bin Li, Daijie Chen, Qi Zhang
TL;DR
The paper tackles multi-view crowd counting without camera calibration or dense annotations by proposing WSCF-MVCC, a three-module framework that uses image-level counts to supervise a single-view counter, learns view correspondences via a homography-based matching estimator, and fuses per-view density maps through learned weights. A self-supervised ranking loss with multi-scale priors strengthens local region predictions, while semantic information guides view matching for better fusion. Experiments on CVCS, CityStreet, and PETS2009 show that WSCF-MVCC outperforms calibration-free baselines and rivals calibrated methods, highlighting practical viability. The work also provides extensive ablations and visualizations that justify the design choices and demonstrate robustness across camera configurations, and it releases code for replication.
Abstract
Multi-view crowd counting can effectively mitigate occlusion issues that commonly arise in single-image crowd counting. Existing deep-learning multi-view crowd counting methods project different camera view images onto a common space to obtain ground-plane density maps, requiring abundant and costly crowd annotations and camera calibrations. Hence, calibration-free methods are proposed that do not require camera calibrations and scene-level crowd annotations. However, existing calibration-free methods still require expensive image-level crowd annotations for training the single-view counting module. Thus, in this paper, we propose a weakly-supervised calibration-free multi-view crowd counting method (WSCF-MVCC), directly using crowd count as supervision for the single-view counting module rather than density maps constructed from crowd annotations. Instead, a self-supervised ranking loss that leverages multi-scale priors is utilized to enhance the model's perceptual ability without additional annotation costs. What's more, the proposed model leverages semantic information to achieve a more accurate view matching and, consequently, a more precise scene-level crowd count estimation. The proposed method outperforms the state-of-the-art methods on three widely used multi-view counting datasets under weakly supervised settings, indicating that it is more suitable for practical deployment compared with calibrated methods. Code is released in https://github.com/zqyq/Weakly-MVCC.
