Table of Contents
Fetching ...

Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance

I-Hsiang Chen, Wei-Ting Chen, Yu-Wei Liu, Ming-Hsuan Yang, Sy-Yen Kuo

TL;DR

This work tackles instability in the learning signal of point-based crowd counting and localization caused by unstable proposal-target matching. It introduces Auxiliary Point Guidance (APG) to provide explicit positive/negative guidance and Implicit Feature Interpolation (IFI) to enable accurate feature extraction at arbitrary positions, forming the APGCC framework. By integrating APG and IFI into a VGG-16/ASPP backbone and a joint objective that combines the standard point-based loss with APG losses, the approach achieves state-of-the-art results across multiple counting and localization benchmarks, including SHHA, SHHB, UCF-QNRF, JHU-Crowd, UCF_CC_50, and NWPU-Crowd. Ablation studies confirm that APG and IFI contribute independently and synergistically, while maintaining efficient inference; the authors also emphasize practical impact through improved robustness across densities and environments and plan public release of code and models.

Abstract

Crowd counting and localization have become increasingly important in computer vision due to their wide-ranging applications. While point-based strategies have been widely used in crowd counting methods, they face a significant challenge, i.e., the lack of an effective learning strategy to guide the matching process. This deficiency leads to instability in matching point proposals to target points, adversely affecting overall performance. To address this issue, we introduce an effective approach to stabilize the proposal-target matching in point-based methods. We propose Auxiliary Point Guidance (APG) to provide clear and effective guidance for proposal selection and optimization, addressing the core issue of matching uncertainty. Additionally, we develop Implicit Feature Interpolation (IFI) to enable adaptive feature extraction in diverse crowd scenarios, further enhancing the model's robustness and accuracy. Extensive experiments demonstrate the effectiveness of our approach, showing significant improvements in crowd counting and localization performance, particularly under challenging conditions. The source codes and trained models will be made publicly available.

Improving Point-based Crowd Counting and Localization Based on Auxiliary Point Guidance

TL;DR

This work tackles instability in the learning signal of point-based crowd counting and localization caused by unstable proposal-target matching. It introduces Auxiliary Point Guidance (APG) to provide explicit positive/negative guidance and Implicit Feature Interpolation (IFI) to enable accurate feature extraction at arbitrary positions, forming the APGCC framework. By integrating APG and IFI into a VGG-16/ASPP backbone and a joint objective that combines the standard point-based loss with APG losses, the approach achieves state-of-the-art results across multiple counting and localization benchmarks, including SHHA, SHHB, UCF-QNRF, JHU-Crowd, UCF_CC_50, and NWPU-Crowd. Ablation studies confirm that APG and IFI contribute independently and synergistically, while maintaining efficient inference; the authors also emphasize practical impact through improved robustness across densities and environments and plan public release of code and models.

Abstract

Crowd counting and localization have become increasingly important in computer vision due to their wide-ranging applications. While point-based strategies have been widely used in crowd counting methods, they face a significant challenge, i.e., the lack of an effective learning strategy to guide the matching process. This deficiency leads to instability in matching point proposals to target points, adversely affecting overall performance. To address this issue, we introduce an effective approach to stabilize the proposal-target matching in point-based methods. We propose Auxiliary Point Guidance (APG) to provide clear and effective guidance for proposal selection and optimization, addressing the core issue of matching uncertainty. Additionally, we develop Implicit Feature Interpolation (IFI) to enable adaptive feature extraction in diverse crowd scenarios, further enhancing the model's robustness and accuracy. Extensive experiments demonstrate the effectiveness of our approach, showing significant improvements in crowd counting and localization performance, particularly under challenging conditions. The source codes and trained models will be made publicly available.
Paper Structure (17 sections, 9 equations, 7 figures, 10 tables)

This paper contains 17 sections, 9 equations, 7 figures, 10 tables.

Figures (7)

  • Figure 1: (Left) Crowd Counting and Localization: Comparison with state-of-the-art methods (e.g., LSC-CNN sam2020locate, TopoCount abousamra2021localization, P2PNet song2021rethinking and CLTR liang2022end) demonstrating the proposed APGCC's effectiveness in accurately counting and localizing in crowded scenes. (Right) Matching Process Instability: Illustrates the instability in selecting point proposals during the matching process by existing point-based methods (e.g., Matcher kuhn1955hungarian) across training epochs, indicated by the Instability Rate (IR), which measures the inconsistency rate of point proposal selection per epoch, leading to limited performance. Both evaluations are conducted on the ShanghaiTech A (SHHA) zhang2016single dataset.
  • Figure 2: Illustration of the Auxiliary Point Guidance framework. During the model's training, we additionally introduce auxiliary positive ($A_{\text{pos}}$) and negative ($A_{\text{pos}}$) points based on each ground truth position to guide the network's learning. This approach helps in directing the optimization process more effectively by distinguishing between potential positive and negative matches.
  • Figure 2: Evaluation of crowd counting on UCF_CC_50 idrees2013multi dataset.
  • Figure 3: Illustration of Implicit Feature Interpolation. Given an arbitrary desired point position $(x, y)$, we concatenate the nearest four feature maps ($Z^{\star}_1$ - $Z^{\star}_4$) along with their distances ($\delta^{\star}_1$ - $\delta^{\star}_4$) to the $(x, y)$ with positional encoding $\phi$ and utilize a Multi-Layer Perceptron (MLP) $f_\theta$ to interpolate the latent feature for that specific location. This approach enables precise feature extraction at non-grid locations, facilitating more flexible and accurate feature representation.
  • Figure 4: Illustration of the proposed APGCC for crowd counting and localization. A VGG encoder extracts image features, where features from conv3 and conv4 layers undergo refinement via Atrous Spatial Pyramid Pooling florian2017rethinking. Subsequently, target latent features are interpolated using implicit feature interpolation. These latent features are then processed through a prediction head to obtain confidence score $\hat{c}$ and offsets $(\Delta_x, \Delta_y)$, facilitating precise crowd counting and localization.
  • ...and 2 more figures