Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes

Zhi Cai; Yingjie Gao; Yaoyan Zheng; Nan Zhou; Di Huang

Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes

Zhi Cai, Yingjie Gao, Yaoyan Zheng, Nan Zhou, Di Huang

TL;DR

This work tackles the problem of object detection in crowded, occluded scenes with limited labeled data. It introduces Crowd-SAM, a SAM-based smart annotator that leverages a Dense prompt strategy guided by a DINOv2 semantic heatmap, an Efficient Prompt Sampler (EPS), and a Part-Whole Discrimination Network (PWD-Net) to select high-quality masks. The method optimizes a composite loss and uses a joint mask score $S = S_{iou} \cdot S_{cls}$ to filter candidates, enabling effective one-class few-shot detection and even multi-class extension. Empirically, Crowd-SAM achieves 78.4% AP on CrowdHuman and shows competitive performance against fully supervised detectors and strong few-shot baselines on multiple datasets, demonstrating significant data efficiency and practical impact for crowded-scene annotation. The approach highlights the potential of integrating large vision foundation models with lightweight discriminators to reduce annotation costs while maintaining high accuracy.

Abstract

In computer vision, object detection is an important task that finds its application in many scenarios. However, obtaining extensive labels can be challenging, especially in crowded scenes. Recently, the Segment Anything Model (SAM) has been proposed as a powerful zero-shot segmenter, offering a novel approach to instance segmentation tasks. However, the accuracy and efficiency of SAM and its variants are often compromised when handling objects in crowded and occluded scenes. In this paper, we introduce Crowd-SAM, a SAM-based framework designed to enhance SAM's performance in crowded and occluded scenes with the cost of few learnable parameters and minimal labeled images. We introduce an efficient prompt sampler (EPS) and a part-whole discrimination network (PWD-Net), enhancing mask selection and accuracy in crowded scenes. Despite its simplicity, Crowd-SAM rivals state-of-the-art (SOTA) fully-supervised object detection methods on several benchmarks including CrowdHuman and CityPersons. Our code is available at https://github.com/FelixCaae/CrowdSAM.

Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes

TL;DR

to filter candidates, enabling effective one-class few-shot detection and even multi-class extension. Empirically, Crowd-SAM achieves 78.4% AP on CrowdHuman and shows competitive performance against fully supervised detectors and strong few-shot baselines on multiple datasets, demonstrating significant data efficiency and practical impact for crowded-scene annotation. The approach highlights the potential of integrating large vision foundation models with lightweight discriminators to reduce annotation costs while maintaining high accuracy.

Abstract

Paper Structure (13 sections, 6 equations, 4 figures, 8 tables, 1 algorithm)

This paper contains 13 sections, 6 equations, 4 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Method
Preliminaries
Problem Definition and Overall Framework
Class-specific Prompt Generation
Semantic-guided Mask Prediction
Training and Inference
Experiments
Experimental Results on Pedestrian Detection
Experimental Results on Multi-class Object Detection
Ablation Studies
Conclusion

Figures (4)

Figure 1: Pipeline comparison between SAM and Crowd-SAM. Crowd-SAM only requires a few labeled images and can automatically recognize target objects.
Figure 2: The pipeline of Crowd-SAM shows the interaction between different modules. DINO encoder and SAM are frozen in the training process. * represents the parameters that are shared. For simplicity, the projection adapter of DINO is dismissed.
Figure 3: Illustration of EPS. PWD-Net produces valid masks with a threshold. In each iteration, we prune prompts (with a cross above) that fall inside valid masks .
Figure 4: Qualitative comparison between Crowd-SAM (a) and De-FRCN (b). Crowd-SAM predictions are more accurate especially in the boundaries of persons. We also plot the GT boxes (blue rectangles) and the generated masks (yellow regions), which are of high quality (c). In (d), we plot our prompt filtering results, where preserved prompts (red points) are much fewer than the removed ones (gray points). Zoom in for a better view.

Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes

TL;DR

Abstract

Crowd-SAM: SAM as a Smart Annotator for Object Detection in Crowded Scenes

Authors

TL;DR

Abstract

Table of Contents

Figures (4)