Table of Contents
Fetching ...

3D Crowd Counting via Geometric Attention-guided Multi-View Fusion

Qi Zhang, Antoni B. Chan

TL;DR

This work considers the variable height of the people in the 3D world and proposes to solve the multi-view crowd counting task through 3D feature fusion with 3D scene-level density maps, instead of the 2D density map on the ground-plane.

Abstract

Recently multi-view crowd counting using deep neural networks has been proposed to enable counting in large and wide scenes using multiple cameras. The current methods project the camera-view features to the average-height plane of the 3D world, and then fuse the projected multi-view features to predict a 2D scene-level density map on the ground (i.e., birds-eye view). Unlike the previous research, we consider the variable height of the people in the 3D world and propose to solve the multi-view crowd counting task through 3D feature fusion with 3D scene-level density maps, instead of the 2D density map on the ground plane. Compared to 2D fusion, the 3D fusion extracts more information of the people along the z-dimension (height), which helps to address the scale variations across multiple views. The 3D density maps still preserve the 2D density maps property that the sum is the count, while also providing 3D information about the crowd density. Furthermore, instead of using the standard method of copying the features along the view ray in the 2D-to-3D projection, we propose an attention module based on a height estimation network, which forces each 2D pixel to be projected to one 3D voxel along the view ray. We also explore the projection consistency among the 3D prediction and the ground truth in the 2D views to further enhance the counting performance. The proposed method is tested on the synthetic and real-world multiview counting datasets and achieves better or comparable counting performance to the state-of-the-art.

3D Crowd Counting via Geometric Attention-guided Multi-View Fusion

TL;DR

This work considers the variable height of the people in the 3D world and proposes to solve the multi-view crowd counting task through 3D feature fusion with 3D scene-level density maps, instead of the 2D density map on the ground-plane.

Abstract

Recently multi-view crowd counting using deep neural networks has been proposed to enable counting in large and wide scenes using multiple cameras. The current methods project the camera-view features to the average-height plane of the 3D world, and then fuse the projected multi-view features to predict a 2D scene-level density map on the ground (i.e., birds-eye view). Unlike the previous research, we consider the variable height of the people in the 3D world and propose to solve the multi-view crowd counting task through 3D feature fusion with 3D scene-level density maps, instead of the 2D density map on the ground plane. Compared to 2D fusion, the 3D fusion extracts more information of the people along the z-dimension (height), which helps to address the scale variations across multiple views. The 3D density maps still preserve the 2D density maps property that the sum is the count, while also providing 3D information about the crowd density. Furthermore, instead of using the standard method of copying the features along the view ray in the 2D-to-3D projection, we propose an attention module based on a height estimation network, which forces each 2D pixel to be projected to one 3D voxel along the view ray. We also explore the projection consistency among the 3D prediction and the ground truth in the 2D views to further enhance the counting performance. The proposed method is tested on the synthetic and real-world multiview counting datasets and achieves better or comparable counting performance to the state-of-the-art.

Paper Structure

This paper contains 26 sections, 10 equations, 10 figures, 10 tables.

Figures (10)

  • Figure 1: An example for the limitation of single-view counting for large and wide scenes: limited field-of-view, low resolution and severe occlusion. This image is from the ShanghaiTech dataset zhang2015cross.
  • Figure 2: The pipeline of 3D crowd counting. Single-view features are extracted and then projected to the 3D world on multiple height planes. The projected 3D features are concatenated and fused to output the 3D density map prediction (loss $l_{3d}$). Each camera-view prediction branch decodes the 2D features to obtain the 2D camera-view predictions (loss $l_{2d}$). Finally, the 3D prediction is back-projected to each camera-view, and the the projection consistency between the camera-view ground-truth and the back-projected prediction is measured (loss $l_{3d\_2d}$).
  • Figure 3: The 2D-3D projection process. (a) previous projection forms the 3D grids by copying the 2D features along the view-ray to the 3D grids with intersection; and (b) the proposed geometric attention guided 2D-3D projection process, which forces one 2D feature to be projected to one 3D voxel and other grids are suppressed. On the right, lighter colors indicates lower probability of the 3D voxel coming from the corresponding 2D feature along the view-ray.
  • Figure 4: The multi-height projection can extract feature of a person along the $z$ dimension and form a 3D feature representation for the person, which is consistent with the 3D scene.
  • Figure 5: The geometric attention module. Each pixel in the camera view image is classified into $N$ height maps. The $N$ height probability maps are projected to 3D scene planes according to their height levels (a set of 2D-2D projections for different height levels). The projected height maps are used as attention to reduce the repeated the features in the 3D counting feature via multiplication.
  • ...and 5 more figures