Table of Contents
Fetching ...

CountFormer: Multi-View Crowd Counting Transformer

Hong Mo, Xiong Zhang, Jianchao Tan, Cheng Yang, Qiong Gu, Bo Hang, Wenqi Ren

TL;DR

CountFormer introduces a 3D multi-view counting framework that lifts image-level features from multiple synchronized views into a scene-level 3D volume using deformable attention-based feature lifting and camera-parameter embeddings. It employs a Cross-View Attention-based lifting, MV volume aggregation, and a 3D FPN-based density predictor to estimate dense 3D crowd density without requiring fixed camera layouts. The method achieves state-of-the-art or competitive results across CityStreet, PETS2009, CVCS, and DukeMTMC, and demonstrates robustness to arbitrary dynamic camera configurations. By removing flat-ground and fixed-layout constraints and highlighting practical considerations such as annotation requirements and efficiency, CountFormer offers a scalable solution for real-world MVC applications.

Abstract

Multi-view counting (MVC) methods have shown their superiority over single-view counterparts, particularly in situations characterized by heavy occlusion and severe perspective distortions. However, hand-crafted heuristic features and identical camera layout requirements in conventional MVC methods limit their applicability and scalability in real-world scenarios.In this work, we propose a concise 3D MVC framework called \textbf{CountFormer}to elevate multi-view image-level features to a scene-level volume representation and estimate the 3D density map based on the volume features. By incorporating a camera encoding strategy, CountFormer successfully embeds camera parameters into the volume query and image-level features, enabling it to handle various camera layouts with significant differences.Furthermore, we introduce a feature lifting module capitalized on the attention mechanism to transform image-level features into a 3D volume representation for each camera view. Subsequently, the multi-view volume aggregation module attentively aggregates various multi-view volumes to create a comprehensive scene-level volume representation, allowing CountFormer to handle images captured by arbitrary dynamic camera layouts. The proposed method performs favorably against the state-of-the-art approaches across various widely used datasets, demonstrating its greater suitability for real-world deployment compared to conventional MVC frameworks.

CountFormer: Multi-View Crowd Counting Transformer

TL;DR

CountFormer introduces a 3D multi-view counting framework that lifts image-level features from multiple synchronized views into a scene-level 3D volume using deformable attention-based feature lifting and camera-parameter embeddings. It employs a Cross-View Attention-based lifting, MV volume aggregation, and a 3D FPN-based density predictor to estimate dense 3D crowd density without requiring fixed camera layouts. The method achieves state-of-the-art or competitive results across CityStreet, PETS2009, CVCS, and DukeMTMC, and demonstrates robustness to arbitrary dynamic camera configurations. By removing flat-ground and fixed-layout constraints and highlighting practical considerations such as annotation requirements and efficiency, CountFormer offers a scalable solution for real-world MVC applications.

Abstract

Multi-view counting (MVC) methods have shown their superiority over single-view counterparts, particularly in situations characterized by heavy occlusion and severe perspective distortions. However, hand-crafted heuristic features and identical camera layout requirements in conventional MVC methods limit their applicability and scalability in real-world scenarios.In this work, we propose a concise 3D MVC framework called \textbf{CountFormer}to elevate multi-view image-level features to a scene-level volume representation and estimate the 3D density map based on the volume features. By incorporating a camera encoding strategy, CountFormer successfully embeds camera parameters into the volume query and image-level features, enabling it to handle various camera layouts with significant differences.Furthermore, we introduce a feature lifting module capitalized on the attention mechanism to transform image-level features into a 3D volume representation for each camera view. Subsequently, the multi-view volume aggregation module attentively aggregates various multi-view volumes to create a comprehensive scene-level volume representation, allowing CountFormer to handle images captured by arbitrary dynamic camera layouts. The proposed method performs favorably against the state-of-the-art approaches across various widely used datasets, demonstrating its greater suitability for real-world deployment compared to conventional MVC frameworks.
Paper Structure (13 sections, 8 equations, 4 figures, 5 tables)

This paper contains 13 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Framework of the CountFormer. The Image Encoder extracts multi-view and multi-level features (MVML) from the multi-view images of the scene. Image-Level Camera Embedding Module fuses camera intrinsic and extrinsic with the MVML features. The elaborate Cross-View Attention Module, a sophisticated attention component, transforms the image-level features into scene-level volume representations. Besides main components, a 2D Density Predictor is used to estimate the image space density, 3D Density Predictors are employed to regress for the 3D scene-level density, and a simple feature pyramid network fuses the multi-scale voxel features.
  • Figure 2: Qualitative Results. The figure exhibits several typical scenarios on the CityStreet (with 3 views) and PETS2009 (with 3 views) datasets, including occlusion and congested crowds. For each sample, the multi-view images, the ground truth scene-level density and estimated density from CVCS methodzhang2021cross, 3D Counting approachzhang20203d, and the CountFormer are presented in the bird's eye view, respectively.
  • Figure 3: Qualitative Results. The figure visualizes 3 challenging scenarios on the CVCS benchmark. Regarding each sample, the visualization includes the multi-view images (with 5 views), ground truth density, density obtained with the MV volume aggregation module, and density estimated without this module.
  • Figure 4: Comparisons with state-of-the-art (SOTA) methods. The figure presents the comparisons between zhang2019widezhang2022widezhang20203dzhang20223dzhang2021crosszheng2021learningqiu2019crosszhang2022calibration and our CountFormer, where the mean absolute error (MAE $\downarrow$) is used to evaluate the performance on the CityStreet datasetzhang2019wide, CVCS datasetzhang2021cross, PETS2009 datasetferryman2009pets2009, and DukeMTMC datasetristani2016performance. For better visualization, we plot the best performance among zhang2019widezhang2022widezhang20203dzhang20223dzhang2021crosszheng2021learningqiu2019crosszhang2022calibration to compare with ours on each dataset.