Table of Contents
Fetching ...

Regressor-Segmenter Mutual Prompt Learning for Crowd Counting

Mingyue Guo, Li Yuan, Zhaoyi Yan, Binghui Chen, Yaowei Wang, Qixiang Ye

TL;DR

This paper tackles the challenge of annotation variance in crowd counting, where dot-based labels introduce bias and context-inference errors in density maps. It introduces mPrompt, a mutual prompt learning framework that jointly trains a regressor and a head-segmentation branch, using point prompts to generate pseudo masks and context prompts to constrain density predictions within segmentation regions. The approach unifies a two-branch architecture with a shared backbone, leverages offline and online point prompts plus a K-NN context strategy, and extends to foundation models via learnable adapters. Empirical results on multiple public datasets show state-of-the-art or competitive MAE performance, with ablations confirming the contribution of each component and visualizations illustrating improved density map accuracy and segmentation quality. Overall, mPrompt demonstrates a principled method to extract robust spatial context from noisy labels and suggests a general framework for integrating segmentation guidance into dense prediction tasks.

Abstract

Crowd counting has achieved significant progress by training regressors to predict instance positions. In heavily crowded scenarios, however, regressors are challenged by uncontrollable annotation variance, which causes density map bias and context information inaccuracy. In this study, we propose mutual prompt learning (mPrompt), which leverages a regressor and a segmenter as guidance for each other, solving bias and inaccuracy caused by annotation variance while distinguishing foreground from background. In specific, mPrompt leverages point annotations to tune the segmenter and predict pseudo head masks in a way of point prompt learning. It then uses the predicted segmentation masks, which serve as spatial constraint, to rectify biased point annotations as context prompt learning. mPrompt defines a way of mutual information maximization from prompt learning, mitigating the impact of annotation variance while improving model accuracy. Experiments show that mPrompt significantly reduces the Mean Average Error (MAE), demonstrating the potential to be general framework for down-stream vision tasks.

Regressor-Segmenter Mutual Prompt Learning for Crowd Counting

TL;DR

This paper tackles the challenge of annotation variance in crowd counting, where dot-based labels introduce bias and context-inference errors in density maps. It introduces mPrompt, a mutual prompt learning framework that jointly trains a regressor and a head-segmentation branch, using point prompts to generate pseudo masks and context prompts to constrain density predictions within segmentation regions. The approach unifies a two-branch architecture with a shared backbone, leverages offline and online point prompts plus a K-NN context strategy, and extends to foundation models via learnable adapters. Empirical results on multiple public datasets show state-of-the-art or competitive MAE performance, with ablations confirming the contribution of each component and visualizations illustrating improved density map accuracy and segmentation quality. Overall, mPrompt demonstrates a principled method to extract robust spatial context from noisy labels and suggests a general framework for integrating segmentation guidance into dense prediction tasks.

Abstract

Crowd counting has achieved significant progress by training regressors to predict instance positions. In heavily crowded scenarios, however, regressors are challenged by uncontrollable annotation variance, which causes density map bias and context information inaccuracy. In this study, we propose mutual prompt learning (mPrompt), which leverages a regressor and a segmenter as guidance for each other, solving bias and inaccuracy caused by annotation variance while distinguishing foreground from background. In specific, mPrompt leverages point annotations to tune the segmenter and predict pseudo head masks in a way of point prompt learning. It then uses the predicted segmentation masks, which serve as spatial constraint, to rectify biased point annotations as context prompt learning. mPrompt defines a way of mutual information maximization from prompt learning, mitigating the impact of annotation variance while improving model accuracy. Experiments show that mPrompt significantly reduces the Mean Average Error (MAE), demonstrating the potential to be general framework for down-stream vision tasks.
Paper Structure (19 sections, 13 equations, 16 figures, 5 tables)

This paper contains 19 sections, 13 equations, 16 figures, 5 tables.

Figures (16)

  • Figure 1: Upper: The biased point annotation impedes accurate model learning. mPrompt leverages context prompt and point prompt to mine spatial context and rectify biased annotation for crowd counting. Lower: Illustration of mutual prompt learning (mPrompt), which completes pseudo segmentation mask by using point prompt learning. Meanwhile, it leverages the rectified masks as spatial context information to refine biased point annotations in a way of context prompt learning. (Best viewed in color)
  • Figure 2: mPrompt consists of four components: a shared backbone for feature extraction, a regressor for density map ($\hat{y}$) prediction, a segmenter for head region ($\hat{m}$) estimation, and a mutual prompt learning module.
  • Figure 3: Illustration of the generation of prompt information for the segmenter. White boxes highlight key regions for better clarity. The red-shaded areas represent the head segmentation mask, demonstrating the pseudo mask's inaccuracy when compared to the more precise updated target mask. With offline prompt, the prompted segmenter tends to predicted more complete head regions but unfortunately introduces background noises. With online prompt, background noises are reduced. (Best viewed in color with zoom)
  • Figure 4: Illustration of $K$-NN algorithm, which removes background noises from the target segmentation mask.
  • Figure 5: mPrompt with learnable prompt modules based on pre-trained model.
  • ...and 11 more figures