Table of Contents
Fetching ...

Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating Holistic Understanding of Crowd Scenes

Yifei Qian, Xiaopeng Hong, Zhongliang Guo, Ognjen Arandjelović, Carl R. Donovan

TL;DR

This work tackles the annotation bottleneck in crowd counting by introducing MRC-Crowd, a mean-teacher semi-supervised framework that masks unlabeled images to train the student to infer counts from holistic scene cues, guided by a teacher trained on fully visible views. It augments regression with a lightweight two-layer density-classification head to better capture density relationships and manifold structure, enabling robust learning from unlabeled data. Across ShanghaiTech A/B, UCF-QNRF, and JHU-Crowd++, MRC-Crowd delivers state-of-the-art results, with notable gains at low labeling ratios and strong generalization to other counting models. The approach is simple, versatile, and practical, requiring only a plug-in density-classification head and a standard EMA-based teacher, making it readily adoptable in real-world counting tasks.

Abstract

To alleviate the heavy annotation burden for training a reliable crowd counting model and thus make the model more practicable and accurate by being able to benefit from more data, this paper presents a new semi-supervised method based on the mean teacher framework. When there is a scarcity of labeled data available, the model is prone to overfit local patches. Within such contexts, the conventional approach of solely improving the accuracy of local patch predictions through unlabeled data proves inadequate. Consequently, we propose a more nuanced approach: fostering the model's intrinsic 'subitizing' capability. This ability allows the model to accurately estimate the count in regions by leveraging its understanding of the crowd scenes, mirroring the human cognitive process. To achieve this goal, we apply masking on unlabeled data, guiding the model to make predictions for these masked patches based on the holistic cues. Furthermore, to help with feature learning, herein we incorporate a fine-grained density classification task. Our method is general and applicable to most existing crowd counting methods as it doesn't have strict structural or loss constraints. In addition, we observe that the model trained with our framework exhibits a 'subitizing'-like behavior. It accurately predicts low-density regions with only a 'glance', while incorporating local details to predict high-density regions. Our method achieves the state-of-the-art performance, surpassing previous approaches by a large margin on challenging benchmarks such as ShanghaiTech A and UCF-QNRF. The code is available at: https://github.com/cha15yq/MRC-Crowd.

Semi-Supervised Crowd Counting with Contextual Modeling: Facilitating Holistic Understanding of Crowd Scenes

TL;DR

This work tackles the annotation bottleneck in crowd counting by introducing MRC-Crowd, a mean-teacher semi-supervised framework that masks unlabeled images to train the student to infer counts from holistic scene cues, guided by a teacher trained on fully visible views. It augments regression with a lightweight two-layer density-classification head to better capture density relationships and manifold structure, enabling robust learning from unlabeled data. Across ShanghaiTech A/B, UCF-QNRF, and JHU-Crowd++, MRC-Crowd delivers state-of-the-art results, with notable gains at low labeling ratios and strong generalization to other counting models. The approach is simple, versatile, and practical, requiring only a plug-in density-classification head and a standard EMA-based teacher, making it readily adoptable in real-world counting tasks.

Abstract

To alleviate the heavy annotation burden for training a reliable crowd counting model and thus make the model more practicable and accurate by being able to benefit from more data, this paper presents a new semi-supervised method based on the mean teacher framework. When there is a scarcity of labeled data available, the model is prone to overfit local patches. Within such contexts, the conventional approach of solely improving the accuracy of local patch predictions through unlabeled data proves inadequate. Consequently, we propose a more nuanced approach: fostering the model's intrinsic 'subitizing' capability. This ability allows the model to accurately estimate the count in regions by leveraging its understanding of the crowd scenes, mirroring the human cognitive process. To achieve this goal, we apply masking on unlabeled data, guiding the model to make predictions for these masked patches based on the holistic cues. Furthermore, to help with feature learning, herein we incorporate a fine-grained density classification task. Our method is general and applicable to most existing crowd counting methods as it doesn't have strict structural or loss constraints. In addition, we observe that the model trained with our framework exhibits a 'subitizing'-like behavior. It accurately predicts low-density regions with only a 'glance', while incorporating local details to predict high-density regions. Our method achieves the state-of-the-art performance, surpassing previous approaches by a large margin on challenging benchmarks such as ShanghaiTech A and UCF-QNRF. The code is available at: https://github.com/cha15yq/MRC-Crowd.
Paper Structure (24 sections, 9 equations, 6 figures, 7 tables)

This paper contains 24 sections, 9 equations, 6 figures, 7 tables.

Figures (6)

  • Figure 1: This illustration shows the problem of models excessively relying on the self-information of individual patches when confronted with limited labeled data. On the far left is the input image with a specific region containing Gaussian noise highlighted by a red rectangle. We show the predicted density map from two models, (a) Ours: the model trained with our proposed method under 40% labeled setting; (b) Sup Only: the same model trained under the same setting but with labeled data only. The count within that region is indicated on the bottom-right of each image. Both models share the same network structure which is detailed in section iii as the 'base network'.
  • Figure 2: (a) The base network structure adopted in MRC-Crowd. It contains a backbone network, a top-down multi-level feature fusion module, a regression head and a classification head. We adopt VGG-19 as backbone network. (b) The diagram of the overall framework of the proposed MRC-Crowd. The labeled data is used for training the student model by optimizing $\mathcal{L}^{s}_{reg}$ and $\mathcal{L}^{s}_{cls}$. The teacher model is updated with the exponential moving average of weights of the student model. The unlabeled data with strong perturbation is fed to the student model while the supervision signals are provided by the predictions of the teacher model on the same data without strong perturbation. Both regression task and the classification task are supervised. The unsupervised learning process is optimized with $\mathcal{L}^{u}$.
  • Figure 3: Shown on the left is an example that has undergone strong augmentation (masked patch size$=32$, masking ratio$=0.3$) and on the right the original image.
  • Figure 4: Visualizations of the results on the testing data of JHU-Crowd++ datasets. The first row is the input images. The second row is the output of the base model trained solely with labeled data, while the third row is the output of our proposed MRC-Crowd. Both models are trained with a labeled ratio of 5%.
  • Figure 5: The plots demonstrate the impact of progressively blurring patches within images on the performance of models trained exclusively with labeled data (referred to as "Sup Only") versus our proposed framework ("ours").
  • ...and 1 more figures