CountZES: Counting via Zero-Shot Exemplar Selection
Muhammad Ibraheem Siddiqui, Muhammad Haris Khan
TL;DR
CountZES tackles zero-shot object counting by learning exemplars without task-specific training. It introduces a three-stage exemplar discovery pipeline (DAE, DGE, FCE) that combines text grounding, density-based consistency, and feature-level consensus, producing diverse, high-quality exemplars. The method demonstrates strong cross-domain generalization to natural, aerial, and medical counting benchmarks and achieves state-of-the-art or competitive performance without fine-tuning. This training-free, modular approach reduces reliance on dataset-specific supervision and offers a scalable solution for counting unseen categories across varied scenes.
Abstract
Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.
