Table of Contents
Fetching ...

Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting

Qi Zhang, Yunfei Gong, Daijie Chen, Antoni B. Chan, Hui Huang

TL;DR

This work tackles multi-view people detection in large, occluded scenes where prior methods trained on small, fixed scenes fail to generalize. It introduces a four-stage supervised view-wise contribution weighting framework that projects per-view features to a common ground plane, uses projected-view decoding with loss $\ell_v$, and fuses views with weights $W_i$ learned by a shared subnet, producing a fused feature $F$ and a final scene prediction with loss $\ell = \ell_s + \lambda \ell_v$; a domain-adaptation discriminator is employed to improve cross-scene transfer. The approach is trained on a large synthetic CVCS dataset and evaluated on CVCS and CityStreet, with additional finetuning and domain adaptation for cross-scene generalization; results show superior cross-scene performance and robust fusion compared to state-of-the-art methods. The study establishes the first large-scene MVD evaluation, demonstrates the effectiveness of supervised view-wise fusion, and broadens practical applicability for surveillance and city-scale monitoring through domain adaptation and synthetic data augmentation.

Abstract

Recent deep learning-based multi-view people detection (MVD) methods have shown promising results on existing datasets. However, current methods are mainly trained and evaluated on small, single scenes with a limited number of multi-view frames and fixed camera views. As a result, these methods may not be practical for detecting people in larger, more complex scenes with severe occlusions and camera calibration errors. This paper focuses on improving multi-view people detection by developing a supervised view-wise contribution weighting approach that better fuses multi-camera information under large scenes. Besides, a large synthetic dataset is adopted to enhance the model's generalization ability and enable more practical evaluation and comparison. The model's performance on new testing scenes is further improved with a simple domain adaptation technique. Experimental results demonstrate the effectiveness of our approach in achieving promising cross-scene multi-view people detection performance. See code here: https://vcc.tech/research/2024/MVD.

Multi-View People Detection in Large Scenes via Supervised View-Wise Contribution Weighting

TL;DR

This work tackles multi-view people detection in large, occluded scenes where prior methods trained on small, fixed scenes fail to generalize. It introduces a four-stage supervised view-wise contribution weighting framework that projects per-view features to a common ground plane, uses projected-view decoding with loss , and fuses views with weights learned by a shared subnet, producing a fused feature and a final scene prediction with loss ; a domain-adaptation discriminator is employed to improve cross-scene transfer. The approach is trained on a large synthetic CVCS dataset and evaluated on CVCS and CityStreet, with additional finetuning and domain adaptation for cross-scene generalization; results show superior cross-scene performance and robust fusion compared to state-of-the-art methods. The study establishes the first large-scene MVD evaluation, demonstrates the effectiveness of supervised view-wise fusion, and broadens practical applicability for surveillance and city-scale monitoring through domain adaptation and synthetic data augmentation.

Abstract

Recent deep learning-based multi-view people detection (MVD) methods have shown promising results on existing datasets. However, current methods are mainly trained and evaluated on small, single scenes with a limited number of multi-view frames and fixed camera views. As a result, these methods may not be practical for detecting people in larger, more complex scenes with severe occlusions and camera calibration errors. This paper focuses on improving multi-view people detection by developing a supervised view-wise contribution weighting approach that better fuses multi-camera information under large scenes. Besides, a large synthetic dataset is adopted to enhance the model's generalization ability and enable more practical evaluation and comparison. The model's performance on new testing scenes is further improved with a simple domain adaptation technique. Experimental results demonstrate the effectiveness of our approach in achieving promising cross-scene multi-view people detection performance. See code here: https://vcc.tech/research/2024/MVD.
Paper Structure (18 sections, 2 equations, 5 figures, 6 tables)

This paper contains 18 sections, 2 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: The scene area comparison of CVCS, CityStreet, Wildtrack, and MultiviewX. The scene size of the latter two datasets is quite smaller than the first two.
  • Figure 2: The pipeline of the proposed view-wise contribution weighting method, which consists of 4 stages: Single-view feature extraction and projection, Projected single-view decoding, Supervised view-wise contribution weighted fusion, and Multi-view feature decoding. First, camera view features are extracted from the shared feature extraction net, and then they are projected to the ground plane. Second, each view's projected feature $F_i$ is fed into a decoder to predict the view's people location map $V_i$ on the ground, and the loss is $\ell_v$, whose ground-truth is obtained from the scene ground-truth $V^{gt}_s$. Third, each view's people location map prediction $V_i$ is fed into a subnet ${\cal C}$ and then weighted across all camera views to obtain weight maps $W_i$ for multi-view fusion. And the predicted weight maps $W_i$ are used to fuse multi-view features $F_i$ in a weighted summation way. Finally, the fused multi-view feature $F$ is decoded to predict the whole scene's people location map, and the loss is $\ell_s$.
  • Figure 3: 'View GT' is the ground-truth for each view in projected single-view decoding, which is the people occupancy map on the ground that can be seen by the corresponding view, and 'Scene GT' stands for the ground-truth for the whole scene of CityStreet. The lines in the 'View GT' indicate the field-of-view region of the camera view.
  • Figure 4: The domain adaptation approach used in our method for generalizing to novel new scenes.
  • Figure 5: The result visualization of the method: camera view input, single-view prediction, view weight map and the corresponding ground-truth and prediction results.