Semi-supervised Counting via Pixel-by-pixel Density Distribution Modelling

Hui Lin; Zhiheng Ma; Rongrong Ji; Yaowei Wang; Zhou Su; Xiaopeng Hong; Deyu Meng

Semi-supervised Counting via Pixel-by-pixel Density Distribution Modelling

Hui Lin, Zhiheng Ma, Rongrong Ji, Yaowei Wang, Zhou Su, Xiaopeng Hong, Deyu Meng

TL;DR

This work tackles semi-supervised crowd counting by reframing pixel density as a probability distribution over density intervals, enabling robust learning with limited labels. It introduces P$^3$Net, which combines a Pixel-wise Distribution Matching (PDM) loss, density-token–augmented Transformer decoding, and a dual-branch interleaving with inter-branch Expectation Consistency Regularization (ECR) to exploit unlabeled data. The method yields state-of-the-art results across multiple benchmarks in semi-supervised settings and remains competitive in fully-supervised scenarios, demonstrating effective use of density distributions and attention-guided density tokens. The approach offers practical benefits for real-world deployment where labeling is expensive, and shows resilience to adverse conditions and varying crowd distributions.

Abstract

This paper focuses on semi-supervised crowd counting, where only a small portion of the training data are labeled. We formulate the pixel-wise density value to regress as a probability distribution, instead of a single deterministic value. On this basis, we propose a semi-supervised crowd-counting model. Firstly, we design a pixel-wise distribution matching loss to measure the differences in the pixel-wise density distributions between the prediction and the ground truth; Secondly, we enhance the transformer decoder by using density tokens to specialize the forwards of decoders w.r.t. different density intervals; Thirdly, we design the interleaving consistency self-supervised learning mechanism to learn from unlabeled data efficiently. Extensive experiments on four datasets are performed to show that our method clearly outperforms the competitors by a large margin under various labeled ratio settings. Code will be released at https://github.com/LoraLinH/Semi-supervised-Counting-via-Pixel-by-pixel-Density-Distribution-Modelling.

Semi-supervised Counting via Pixel-by-pixel Density Distribution Modelling

TL;DR

This work tackles semi-supervised crowd counting by reframing pixel density as a probability distribution over density intervals, enabling robust learning with limited labels. It introduces P

Net, which combines a Pixel-wise Distribution Matching (PDM) loss, density-token–augmented Transformer decoding, and a dual-branch interleaving with inter-branch Expectation Consistency Regularization (ECR) to exploit unlabeled data. The method yields state-of-the-art results across multiple benchmarks in semi-supervised settings and remains competitive in fully-supervised scenarios, demonstrating effective use of density distributions and attention-guided density tokens. The approach offers practical benefits for real-world deployment where labeling is expensive, and shows resilience to adverse conditions and varying crowd distributions.

Abstract

Paper Structure (33 sections, 9 equations, 9 figures, 15 tables, 1 algorithm)

This paper contains 33 sections, 9 equations, 9 figures, 15 tables, 1 algorithm.

Introduction
Related Works
Fully-supervised Crowd Counting.
Semi and Weakly-Supervised Crowd Counting.
Vision Transformer.
Counting via Pixel-by-Pixel Probabilistic Distribution Modelling
Pixel-wise Distribution Matching Loss
Transformer specialization
Inter-branch Expectation Consistency Based On Dual-Branch Interleaving Structure
Experiments
Implementation Details
Comparisons to the State of the Arts
Semi-supervised Counting Performance on NWPU
The impact of PDM and ECR loss.
The influence of norm level in PDM loss.
...and 18 more sections

Figures (9)

Figure 1: The structure of P$^3$Net. (a) The dual-branch structure with density tokens to predict interleaving density category. Different token colors represent the specified different density intervals. The softmax operation targets the category while each predicted distribution map represents the segmentation map learned by this specific token and the corresponding density category. (b) The structure of the decoder. (c) The inter-branch Expectation Consistency Regularization for self-supervised learning. (d) The horizontal axis stands for the density values, with the attached squares for the discrete density intervals corresponding to the tokens. The vertical axis is the normalized distribution value of each category for that pixel.
Figure 2: Similar Regions of the same density levels exists within an image. We use a density token to specify a density interval and group the regions of that level.
Figure 3: Visualizations of predicted densities on unlabeled training images of ShanghaiTech A. The first row: input images. The second row: predicted density maps by SUA model. The third row: predicted density maps by our P$^3$Net. For SUA model in unlabeled data, serious false alarms in the background are observed, as shown in the second row. In contrast, our density token guided model can perform more stability and thus produce density maps with better accuracy in the third row.
Figure 4: SDDS uses a shared decoder with two independent sets of tokens for interleaving intervals.
Figure 5: STDS uses two independent decoders with a shared set of tokens for interleaving intervals.
...and 4 more figures

Semi-supervised Counting via Pixel-by-pixel Density Distribution Modelling

TL;DR

Abstract

Semi-supervised Counting via Pixel-by-pixel Density Distribution Modelling

Authors

TL;DR

Abstract

Table of Contents

Figures (9)