Table of Contents
Fetching ...

Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting

Wei Lin, Chenyang Zhao, Antoni B. Chan

TL;DR

This work addresses the labor-intensive annotation required for point-based crowd counting by introducing a semi-supervised learning framework that leverages pseudo-labels. Through PSAM, a gradient-based visualization, the authors reveal that background regions fail to receive useful supervision under point-to-point (P2P) matching, which motivates a shift to Point-to-Region (P2R) matching that propagates pseudo-label confidence to local regions. The proposed P2R loss eliminates the need for the computationally heavy Hungarian matching, while enabling effective training with limited labeled data and abundant unlabeled data, achieving strong results in semi-supervised counting and unsupervised domain adaptation. Empirical results on multiple datasets show P2R outperforms or matches state-of-the-art methods, with substantial gains in efficiency and robustness, and the authors provide code for reproducibility.

Abstract

Point detection has been developed to locate pedestrians in crowded scenes by training a counter through a point-to-point (P2P) supervision scheme. Despite its excellent localization and counting performance, training a point-based counter still faces challenges concerning annotation labor: hundreds to thousands of points are required to annotate a single sample capturing a dense crowd. In this paper, we integrate point-based methods into a semi-supervised counting framework based on pseudo-labeling, enabling the training of a counter with only a few annotated samples supplemented by a large volume of pseudo-labeled data. However, during implementation, the training encounters issues as the confidence for pseudo-labels fails to be propagated to background pixels via the P2P. To tackle this challenge, we devise a point-specific activation map (PSAM) to visually interpret the phenomena occurring during the ill-posed training. Observations from the PSAM suggest that the feature map is excessively activated by the loss for unlabeled data, causing the decoder to misinterpret these over-activations as pedestrians. To mitigate this issue, we propose a point-to-region (P2R) scheme to substitute P2P, which segments out local regions rather than detects a point corresponding to a pedestrian for supervision. Consequently, pixels in the local region can share the same confidence with the corresponding pseudo points. Experimental results in both semi-supervised counting and unsupervised domain adaptation highlight the advantages of our method, illustrating P2R can resolve issues identified in PSAM. The code is available at https://github.com/Elin24/P2RLoss.

Point-to-Region Loss for Semi-Supervised Point-Based Crowd Counting

TL;DR

This work addresses the labor-intensive annotation required for point-based crowd counting by introducing a semi-supervised learning framework that leverages pseudo-labels. Through PSAM, a gradient-based visualization, the authors reveal that background regions fail to receive useful supervision under point-to-point (P2P) matching, which motivates a shift to Point-to-Region (P2R) matching that propagates pseudo-label confidence to local regions. The proposed P2R loss eliminates the need for the computationally heavy Hungarian matching, while enabling effective training with limited labeled data and abundant unlabeled data, achieving strong results in semi-supervised counting and unsupervised domain adaptation. Empirical results on multiple datasets show P2R outperforms or matches state-of-the-art methods, with substantial gains in efficiency and robustness, and the authors provide code for reproducibility.

Abstract

Point detection has been developed to locate pedestrians in crowded scenes by training a counter through a point-to-point (P2P) supervision scheme. Despite its excellent localization and counting performance, training a point-based counter still faces challenges concerning annotation labor: hundreds to thousands of points are required to annotate a single sample capturing a dense crowd. In this paper, we integrate point-based methods into a semi-supervised counting framework based on pseudo-labeling, enabling the training of a counter with only a few annotated samples supplemented by a large volume of pseudo-labeled data. However, during implementation, the training encounters issues as the confidence for pseudo-labels fails to be propagated to background pixels via the P2P. To tackle this challenge, we devise a point-specific activation map (PSAM) to visually interpret the phenomena occurring during the ill-posed training. Observations from the PSAM suggest that the feature map is excessively activated by the loss for unlabeled data, causing the decoder to misinterpret these over-activations as pedestrians. To mitigate this issue, we propose a point-to-region (P2R) scheme to substitute P2P, which segments out local regions rather than detects a point corresponding to a pedestrian for supervision. Consequently, pixels in the local region can share the same confidence with the corresponding pseudo points. Experimental results in both semi-supervised counting and unsupervised domain adaptation highlight the advantages of our method, illustrating P2R can resolve issues identified in PSAM. The code is available at https://github.com/Elin24/P2RLoss.

Paper Structure

This paper contains 26 sections, 27 equations, 11 figures, 4 tables, 2 algorithms.

Figures (11)

  • Figure 1: The workflow of semi-supervised point-based counting methods. The teacher model generates pseudo labels by extracting the foreground pixels, while the student model takes the corresponding strongly augmented image as input to construct the computation graph. The training loss between the pseudo label and the student's prediction involves two steps: the proposed P2R matching and the weighted cross-entropy computation.
  • Figure 2: The generation process of PSAM.
  • Figure 3: Observations in PSAM. (a) The training process, where model-L and model-U are extracted from the 100th and 200th epochs, respectively. (b) Comparing sorted values of PSAMs in the two models. (c) & (d) Visualizing the average of local PSAM, and (e) & (f) the aggregated PSAM to compare model-L and model-U from a global perspective.
  • Figure 4: Difference between P2P and P2R matching. (a) and (d) demonstrate an overall difference. P2P focuses only on foreground pixels in $\mathbf{P}_s$, while P2R considers background pixels as well. (b) and (e) show how the matching is performed in P2P and P2R, respectively. P2R segments out local regions for each pseudo-label, whereas P2P only detects one point. (c) and (f) illustrate how untrusted predictions are filtered. P2P retains only foreground pixels for loss computation, while P2R also keeps pixels in the neighborhoods of their corresponding pseudo points.
  • Figure 5: qualitative comparison with other models.
  • ...and 6 more figures