Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting
Qi Zhang, Yunfei Gong, Daijie Chen, Antoni B. Chan, Hui Huang
TL;DR
This work tackles multi-view people detection in large, occluded scenes where prior methods trained on small, fixed scenes fail to generalize. It introduces a four-stage supervised view-wise contribution weighting framework that projects per-view features to a common ground plane, uses projected-view decoding with loss $\ell_v$, and fuses views with weights $W_i$ learned by a shared subnet, producing a fused feature $F$ and a final scene prediction with loss $\ell = \ell_s + \lambda \ell_v$; a domain-adaptation discriminator is employed to improve cross-scene transfer. The approach is trained on a large synthetic CVCS dataset and evaluated on CVCS and CityStreet, with additional finetuning and domain adaptation for cross-scene generalization; results show superior cross-scene performance and robust fusion compared to state-of-the-art methods. The study establishes the first large-scene MVD evaluation, demonstrates the effectiveness of supervised view-wise fusion, and broadens practical applicability for surveillance and city-scale monitoring through domain adaptation and synthetic data augmentation.
Abstract
Recent deep learning-based multi-view people detection (MVD) methods have shown promising results on existing datasets. However, current methods are mainly trained and evaluated on small, single scenes with a limited number of multi-view frames and fixed camera views. As a result, these methods may not be practical for detecting people in larger, more complex scenes with severe occlusions and camera calibration errors. This paper focuses on improving multi-view people detection by developing a supervised view-wise contribution weighting approach that better fuses multi-camera information under large scenes. Besides, a large synthetic dataset is adopted to enhance the model's generalization ability and enable more practical evaluation and comparison. The model's performance on new testing scenes is further improved with a simple domain adaptation technique. Experimental results demonstrate the effectiveness of our approach in achieving promising cross-scene multi-view people detection performance. See code here: https://vcc.tech/research/2024/MVD.
