Table of Contents
Fetching ...

Robust Zero-Shot Crowd Counting and Localization With Adaptive Resolution SAM

Jia Wan, Qiangqiang Wu, Wei Lin, Antoni B. Chan

TL;DR

This work tackles unsupervised crowd counting in dense scenes by leveraging AdaSEEM, an adaptive-resolution extension of SEEM, to generate high-quality masks and address scale and occlusion challenges. It couples this with a Gaussian Mixture Model–based head localization to produce reliable point pseudo-labels and a robust loss that combines mask and point supervision while excluding uncertain regions, enabling effective density-map regression without manual labels. An iterative pseudo-label generation loop refines segmentation masks by using predictions from a well-trained counting network, yielding progressively better pseudo-labels and improved counts and localizations. Evaluations across multiple datasets show state-of-the-art unsupervised performance and competitive results with some supervised methods, highlighting the method’s potential for labeling-free crowd analytics and its applicability to real-world scenarios where annotations are scarce or unavailable.

Abstract

The existing crowd counting models require extensive training data, which is time-consuming to annotate. To tackle this issue, we propose a simple yet effective crowd counting method by utilizing the Segment-Everything-Everywhere Model (SEEM), an adaptation of the Segmentation Anything Model (SAM), to generate pseudo-labels for training crowd counting models. However, our initial investigation reveals that SEEM's performance in dense crowd scenes is limited, primarily due to the omission of many persons in high-density areas. To overcome this limitation, we propose an adaptive resolution SEEM to handle the scale variations, occlusions, and overlapping of people within crowd scenes. Alongside this, we introduce a robust localization method, based on Gaussian Mixture Models, for predicting the head positions in the predicted people masks. Given the mask and point pseudo-labels, we propose a robust loss function, which is designed to exclude uncertain regions based on SEEM's predictions, thereby enhancing the training process of the counting networks. Finally, we propose an iterative method for generating pseudo-labels. This method aims at improving the quality of the segmentation masks by identifying more tiny persons in high-density regions, which are often missed in the first pseudo-labeling stage. Overall, our proposed method achieves the best unsupervised performance in crowd counting, while also being comparable results to some supervised methods. This makes it a highly effective and versatile tool for crowd counting, especially in situations where labeled data is not available.

Robust Zero-Shot Crowd Counting and Localization With Adaptive Resolution SAM

TL;DR

This work tackles unsupervised crowd counting in dense scenes by leveraging AdaSEEM, an adaptive-resolution extension of SEEM, to generate high-quality masks and address scale and occlusion challenges. It couples this with a Gaussian Mixture Model–based head localization to produce reliable point pseudo-labels and a robust loss that combines mask and point supervision while excluding uncertain regions, enabling effective density-map regression without manual labels. An iterative pseudo-label generation loop refines segmentation masks by using predictions from a well-trained counting network, yielding progressively better pseudo-labels and improved counts and localizations. Evaluations across multiple datasets show state-of-the-art unsupervised performance and competitive results with some supervised methods, highlighting the method’s potential for labeling-free crowd analytics and its applicability to real-world scenarios where annotations are scarce or unavailable.

Abstract

The existing crowd counting models require extensive training data, which is time-consuming to annotate. To tackle this issue, we propose a simple yet effective crowd counting method by utilizing the Segment-Everything-Everywhere Model (SEEM), an adaptation of the Segmentation Anything Model (SAM), to generate pseudo-labels for training crowd counting models. However, our initial investigation reveals that SEEM's performance in dense crowd scenes is limited, primarily due to the omission of many persons in high-density areas. To overcome this limitation, we propose an adaptive resolution SEEM to handle the scale variations, occlusions, and overlapping of people within crowd scenes. Alongside this, we introduce a robust localization method, based on Gaussian Mixture Models, for predicting the head positions in the predicted people masks. Given the mask and point pseudo-labels, we propose a robust loss function, which is designed to exclude uncertain regions based on SEEM's predictions, thereby enhancing the training process of the counting networks. Finally, we propose an iterative method for generating pseudo-labels. This method aims at improving the quality of the segmentation masks by identifying more tiny persons in high-density regions, which are often missed in the first pseudo-labeling stage. Overall, our proposed method achieves the best unsupervised performance in crowd counting, while also being comparable results to some supervised methods. This makes it a highly effective and versatile tool for crowd counting, especially in situations where labeled data is not available.
Paper Structure (15 sections, 5 equations, 12 figures, 3 tables, 1 algorithm)

This paper contains 15 sections, 5 equations, 12 figures, 3 tables, 1 algorithm.

Figures (12)

  • Figure 1: The motivation for our proposed method lies in accurately detecting individuals in high-density areas, where they are often missed due to occlusion and overlapping. Our approach includes zooming into these crowded regions, as this increased resolution helps in identifying previously undetected individuals. For consistency, all regions are resized to $512\times512$ pixels before segmentation.
  • Figure 2: Our framework for unsupervised crowd counting. First, we generate person mask pseudo-labels using an adaptive resolution SAM (AdaSEEM) to enhance the segmentation of small-sized objects in crowd images. We then predict point pseudo-labels via a robust method for head localization achieved by modeling the soft mask distribution using a Gaussian Mixture Model (GMM). The next phase involves training a counting network using a robust loss function that is specifically designed to use the generated mask/point pseudo labels. Finally, we employ an iterative process to generate additional pseudo labels by leveraging the predictions of the trained counting network.
  • Figure 3: The masks generated from different methods. From left to right are: SEEM, adaptive resolution SEEM (AdaSEEM), and AdaSEEM + Iter. 0 predictions. In (c), the new pseudo-label masks are highlighted with blue ellipses.
  • Figure 4: The comparison of different methods across varying density levels of ShanghaiTech A dataset: low-density (count $\leq$ 300), medium-density (300 $<$ count $\geq$ 600), and high-density (count $>$ 600).
  • Figure 5: The visualization of the predicted density maps. Note that unsupervised methods typically lack the capability to predict such density maps, e.g., liang2023crowdclip only predicts the count.
  • ...and 7 more figures