Table of Contents
Fetching ...

Semi-supervised Counting via Pixel-by-pixel Density Distribution Modelling

Hui Lin, Zhiheng Ma, Rongrong Ji, Yaowei Wang, Zhou Su, Xiaopeng Hong, Deyu Meng

TL;DR

This work tackles semi-supervised crowd counting by reframing pixel density as a probability distribution over density intervals, enabling robust learning with limited labels. It introduces P$^3$Net, which combines a Pixel-wise Distribution Matching (PDM) loss, density-token–augmented Transformer decoding, and a dual-branch interleaving with inter-branch Expectation Consistency Regularization (ECR) to exploit unlabeled data. The method yields state-of-the-art results across multiple benchmarks in semi-supervised settings and remains competitive in fully-supervised scenarios, demonstrating effective use of density distributions and attention-guided density tokens. The approach offers practical benefits for real-world deployment where labeling is expensive, and shows resilience to adverse conditions and varying crowd distributions.

Abstract

This paper focuses on semi-supervised crowd counting, where only a small portion of the training data are labeled. We formulate the pixel-wise density value to regress as a probability distribution, instead of a single deterministic value. On this basis, we propose a semi-supervised crowd-counting model. Firstly, we design a pixel-wise distribution matching loss to measure the differences in the pixel-wise density distributions between the prediction and the ground truth; Secondly, we enhance the transformer decoder by using density tokens to specialize the forwards of decoders w.r.t. different density intervals; Thirdly, we design the interleaving consistency self-supervised learning mechanism to learn from unlabeled data efficiently. Extensive experiments on four datasets are performed to show that our method clearly outperforms the competitors by a large margin under various labeled ratio settings. Code will be released at https://github.com/LoraLinH/Semi-supervised-Counting-via-Pixel-by-pixel-Density-Distribution-Modelling.

Semi-supervised Counting via Pixel-by-pixel Density Distribution Modelling

TL;DR

This work tackles semi-supervised crowd counting by reframing pixel density as a probability distribution over density intervals, enabling robust learning with limited labels. It introduces PNet, which combines a Pixel-wise Distribution Matching (PDM) loss, density-token–augmented Transformer decoding, and a dual-branch interleaving with inter-branch Expectation Consistency Regularization (ECR) to exploit unlabeled data. The method yields state-of-the-art results across multiple benchmarks in semi-supervised settings and remains competitive in fully-supervised scenarios, demonstrating effective use of density distributions and attention-guided density tokens. The approach offers practical benefits for real-world deployment where labeling is expensive, and shows resilience to adverse conditions and varying crowd distributions.

Abstract

This paper focuses on semi-supervised crowd counting, where only a small portion of the training data are labeled. We formulate the pixel-wise density value to regress as a probability distribution, instead of a single deterministic value. On this basis, we propose a semi-supervised crowd-counting model. Firstly, we design a pixel-wise distribution matching loss to measure the differences in the pixel-wise density distributions between the prediction and the ground truth; Secondly, we enhance the transformer decoder by using density tokens to specialize the forwards of decoders w.r.t. different density intervals; Thirdly, we design the interleaving consistency self-supervised learning mechanism to learn from unlabeled data efficiently. Extensive experiments on four datasets are performed to show that our method clearly outperforms the competitors by a large margin under various labeled ratio settings. Code will be released at https://github.com/LoraLinH/Semi-supervised-Counting-via-Pixel-by-pixel-Density-Distribution-Modelling.
Paper Structure (33 sections, 9 equations, 9 figures, 15 tables, 1 algorithm)

This paper contains 33 sections, 9 equations, 9 figures, 15 tables, 1 algorithm.

Figures (9)

  • Figure 1: The structure of P$^3$Net. (a) The dual-branch structure with density tokens to predict interleaving density category. Different token colors represent the specified different density intervals. The softmax operation targets the category while each predicted distribution map represents the segmentation map learned by this specific token and the corresponding density category. (b) The structure of the decoder. (c) The inter-branch Expectation Consistency Regularization for self-supervised learning. (d) The horizontal axis stands for the density values, with the attached squares for the discrete density intervals corresponding to the tokens. The vertical axis is the normalized distribution value of each category for that pixel.
  • Figure 2: Similar Regions of the same density levels exists within an image. We use a density token to specify a density interval and group the regions of that level.
  • Figure 3: Visualizations of predicted densities on unlabeled training images of ShanghaiTech A. The first row: input images. The second row: predicted density maps by SUA model. The third row: predicted density maps by our P$^3$Net. For SUA model in unlabeled data, serious false alarms in the background are observed, as shown in the second row. In contrast, our density token guided model can perform more stability and thus produce density maps with better accuracy in the third row.
  • Figure 4: SDDS uses a shared decoder with two independent sets of tokens for interleaving intervals.
  • Figure 5: STDS uses two independent decoders with a shared set of tokens for interleaving intervals.
  • ...and 4 more figures