Table of Contents
Fetching ...

Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting

Maochen Yang, Zekun Li, Jian Zhang, Lei Qi, Yinghuan Shi

TL;DR

This work targets the semi-supervised crowd counting problem under limited labeled data and challenging scenes. It introduces Taste More Taste Better (TMTB), a framework that combines diffusion-based Inpainting Augmentation with a Visual State Space Model (VSSM) backbone and an Anti-Noise classification head within a Mean Teacher paradigm to produce robust pseudo-supervision. Key innovations include a foreground-preserving inpainting strategy guided by count-interval predictions, an EMA-based inconsistency filter for unreliable augmentations, and a dual-headed architecture that learns both exact density maps and interval-based counts. Across four benchmark datasets and multiple labeling regimes, TMTB achieves state-of-the-art MAEs, demonstrating strong label-efficiency and cross-dataset generalization, with notable improvements such as a 12.4% MAE reduction on JHU-Crowd++ at 5% labels.

Abstract

Semi-supervised crowd counting is crucial for addressing the high annotation costs of densely populated scenes. Although several methods based on pseudo-labeling have been proposed, it remains challenging to effectively and accurately utilize unlabeled data. In this paper, we propose a novel framework called Taste More Taste Better (TMTB), which emphasizes both data and model aspects. Firstly, we explore a data augmentation technique well-suited for the crowd counting task. By inpainting the background regions, this technique can effectively enhance data diversity while preserving the fidelity of the entire scenes. Secondly, we introduce the Visual State Space Model as backbone to capture the global context information from crowd scenes, which is crucial for extremely crowded, low-light, and adverse weather scenarios. In addition to the traditional regression head for exact prediction, we employ an Anti-Noise classification head to provide less exact but more accurate supervision, since the regression head is sensitive to noise in manual annotations. We conduct extensive experiments on four benchmark datasets and show that our method outperforms state-of-the-art methods by a large margin. Code is publicly available on https://github.com/syhien/taste_more_taste_better.

Taste More, Taste Better: Diverse Data and Strong Model Boost Semi-Supervised Crowd Counting

TL;DR

This work targets the semi-supervised crowd counting problem under limited labeled data and challenging scenes. It introduces Taste More Taste Better (TMTB), a framework that combines diffusion-based Inpainting Augmentation with a Visual State Space Model (VSSM) backbone and an Anti-Noise classification head within a Mean Teacher paradigm to produce robust pseudo-supervision. Key innovations include a foreground-preserving inpainting strategy guided by count-interval predictions, an EMA-based inconsistency filter for unreliable augmentations, and a dual-headed architecture that learns both exact density maps and interval-based counts. Across four benchmark datasets and multiple labeling regimes, TMTB achieves state-of-the-art MAEs, demonstrating strong label-efficiency and cross-dataset generalization, with notable improvements such as a 12.4% MAE reduction on JHU-Crowd++ at 5% labels.

Abstract

Semi-supervised crowd counting is crucial for addressing the high annotation costs of densely populated scenes. Although several methods based on pseudo-labeling have been proposed, it remains challenging to effectively and accurately utilize unlabeled data. In this paper, we propose a novel framework called Taste More Taste Better (TMTB), which emphasizes both data and model aspects. Firstly, we explore a data augmentation technique well-suited for the crowd counting task. By inpainting the background regions, this technique can effectively enhance data diversity while preserving the fidelity of the entire scenes. Secondly, we introduce the Visual State Space Model as backbone to capture the global context information from crowd scenes, which is crucial for extremely crowded, low-light, and adverse weather scenarios. In addition to the traditional regression head for exact prediction, we employ an Anti-Noise classification head to provide less exact but more accurate supervision, since the regression head is sensitive to noise in manual annotations. We conduct extensive experiments on four benchmark datasets and show that our method outperforms state-of-the-art methods by a large margin. Code is publicly available on https://github.com/syhien/taste_more_taste_better.

Paper Structure

This paper contains 10 sections, 8 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Comparisons of augmentations. As shown in the red bounding boxes, both types of Mixup generate disappointing density maps. Mixup is supposed to retain crowd information from two images, but actually it destroys the spatial structure. As shown in green bounding boxes, our inpainting augmentation is better suited for the crowd counting task, showing an impressive reduction in Mean Absolute Error (MAE).
  • Figure 2: The overall framework of our method TMTB. TMTB contains the Mean Teacher framework for semi-supervised learning, VSSM as the backbone, a classification branch for predicting masks, and an inpainter for inpainting augmentation. The inpainting process is conducted periodically, which generates or updates the inpainted images. For filtering out unreliable regions, the teacher model generates the weighted mask based on the inconsistency level of the inpainted image, which is applied in $\mathcal{L}^{\text{inp}}$. The teacher model generates pseudo-labels for unlabeled data (inpainted images included), and the student model predicts on labeled data and strongly augmented unlabeled data (inpainted images included).
  • Figure 3: Inpainting samples. The left part shows well-inpainted couples, while the right part shows poorly-inpainted couples. Column (a) and (c) are the original images, while column (b) and (d) are the inpainted images.
  • Figure 4: Visualizations of the predictions on the test set of the JHU-Crowd++ dataset. The first row shows the input images. The second row displays the predictions of the SOTA method MRC-Crowd mrc-crowd2024. The third row presents the predictions of our method TMTB. All models are trained with 5% labeled data.