Table of Contents
Fetching ...

Exclusivity-Guided Mask Learning for Semi-Supervised Crowd Instance Segmentation and Counting

Jiyang Huang, Hongru Cheng, Wei Lin, Jia Wan, Antoni B. Chan

Abstract

Semi-supervised crowd analysis is a prominent area of research, as unlabeled data are typically abundant and inexpensive to obtain. However, traditional point-based annotations constrain performance because individual regions are inherently ambiguous, and consequently, learning fine-grained structural semantics from sparse anno tations remains an unresolved challenge. In this paper, we first propose an Exclusion-Constrained Dual-Prompt SAM (EDP-SAM), based on our Nearest Neighbor Exclusion Circle (NNEC) constraint, to generate mask supervision for current datasets. With the aim of segmenting individuals in dense scenes, we then propose Exclusivity-Guided Mask Learning (XMask), which enforces spatial separation through a discriminative mask objective. Gaussian smoothing and a differentiable center sampling strategy are utilized to improve feature continuity and training stability. Building on XMask, we present a semi-supervised crowd counting framework that uses instance mask priors as pseudo-labels, which contain richer shape information than traditional point cues. Extensive experiments on the ShanghaiTech A, UCF-QNRF, and JHU++ datasets (using 5%, 10%, and 40% labeled data) verify that our end-to-end model achieves state-of-the-art semi-supervised segmentation and counting performance, effectively bridging the gap between counting and instance segmentation within a unified framework.

Exclusivity-Guided Mask Learning for Semi-Supervised Crowd Instance Segmentation and Counting

Abstract

Semi-supervised crowd analysis is a prominent area of research, as unlabeled data are typically abundant and inexpensive to obtain. However, traditional point-based annotations constrain performance because individual regions are inherently ambiguous, and consequently, learning fine-grained structural semantics from sparse anno tations remains an unresolved challenge. In this paper, we first propose an Exclusion-Constrained Dual-Prompt SAM (EDP-SAM), based on our Nearest Neighbor Exclusion Circle (NNEC) constraint, to generate mask supervision for current datasets. With the aim of segmenting individuals in dense scenes, we then propose Exclusivity-Guided Mask Learning (XMask), which enforces spatial separation through a discriminative mask objective. Gaussian smoothing and a differentiable center sampling strategy are utilized to improve feature continuity and training stability. Building on XMask, we present a semi-supervised crowd counting framework that uses instance mask priors as pseudo-labels, which contain richer shape information than traditional point cues. Extensive experiments on the ShanghaiTech A, UCF-QNRF, and JHU++ datasets (using 5%, 10%, and 40% labeled data) verify that our end-to-end model achieves state-of-the-art semi-supervised segmentation and counting performance, effectively bridging the gap between counting and instance segmentation within a unified framework.
Paper Structure (22 sections, 17 equations, 14 figures, 4 tables)

This paper contains 22 sections, 17 equations, 14 figures, 4 tables.

Figures (14)

  • Figure 1: Conception of our approach. EDP-SAM is designed to generate data with mask supervision. To acquire segmentation capabilities, we propose a semi-supervised XMask method to learn a discriminative feature representation. Subsequently, we present a mask-constraint strategy composing the final step in our framework, coupling semantic and spatial information to enhance counting accuracy.
  • Figure 2: Overview of the proposed framework. Our semi-supervised pipeline consists of three stages: (1) pretraining and mask preparation, (2) crowd segmentation training, and (3) mask-constrained crowd counting optimization. In Stage 1, mask annotations are generated from point labels via the proposed EDP-SAM (yellow region). Stage 2 presents the semi-supervised XMask training process, including discriminative supervision and pseudo-mask generation. The trained segmentation model is then utilized in Stage 3 to produce pseudo-masks for unlabeled data, where newly designed loss functions are introduced to further enhance counting performance via mask constraint.
  • Figure 3: Point-Superpixel Dual-Prompt SAM with NNEC Constraint Method
  • Figure 4: MaskHead Discriminative Loss. First, use points to obtain the corresponding ROIs for computation. Based on the pseudo/ground truth mask, dividing the ROIs into positive and negative regions. After applying Gaussian smoothing, extract the center feature and compute pixel-level L2 distance within feature map. Finally, apply pull and push force losses separately.
  • Figure 5: Pseudo-Mask Selection. We utilize a low threshold to obtain pseudo points, thereby generating all possible masks and ensuring the reliability of background. Then, based on the probability, we select the mask associated with the highest probability as the final valid mask.
  • ...and 9 more figures