Table of Contents
Fetching ...

CountZES: Counting via Zero-Shot Exemplar Selection

Muhammad Ibraheem Siddiqui, Muhammad Haris Khan

TL;DR

CountZES tackles zero-shot object counting by learning exemplars without task-specific training. It introduces a three-stage exemplar discovery pipeline (DAE, DGE, FCE) that combines text grounding, density-based consistency, and feature-level consensus, producing diverse, high-quality exemplars. The method demonstrates strong cross-domain generalization to natural, aerial, and medical counting benchmarks and achieves state-of-the-art or competitive performance without fine-tuning. This training-free, modular approach reduces reliance on dataset-specific supervision and offers a scalable solution for counting unseen categories across varied scenes.

Abstract

Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.

CountZES: Counting via Zero-Shot Exemplar Selection

TL;DR

CountZES tackles zero-shot object counting by learning exemplars without task-specific training. It introduces a three-stage exemplar discovery pipeline (DAE, DGE, FCE) that combines text grounding, density-based consistency, and feature-level consensus, producing diverse, high-quality exemplars. The method demonstrates strong cross-domain generalization to natural, aerial, and medical counting benchmarks and achieves state-of-the-art or competitive performance without fine-tuning. This training-free, modular approach reduces reliance on dataset-specific supervision and offers a scalable solution for counting unseen categories across varied scenes.

Abstract

Object counting in complex scenes remains challenging, particularly in the zero-shot setting, where the goal is to count instances of unseen categories specified only by a class name. Existing zero-shot object counting (ZOC) methods that infer exemplars from text either rely on open-vocabulary detectors, which often yield multi-instance candidates, or on random patch sampling, which fails to accurately delineate object instances. To address this, we propose CountZES, a training-free framework for object counting via zero-shot exemplar selection. CountZES progressively discovers diverse exemplars through three synergistic stages: Detection-Anchored Exemplar (DAE), Density-Guided Exemplar (DGE), and Feature-Consensus Exemplar (FCE). DAE refines open-vocabulary detections to isolate precise single-instance exemplars. DGE introduces a density-driven, self-supervised paradigm to identify statistically consistent and semantically compact exemplars, while FCE reinforces visual coherence through feature-space clustering. Together, these stages yield a diverse, complementary exemplar set that balances textual grounding, count consistency, and feature representativeness. Experiments on diverse datasets demonstrate CountZES superior performance among ZOC methods while generalizing effectively across natural, aerial and medical domains.

Paper Structure

This paper contains 11 sections, 7 equations, 13 figures, 6 tables.

Figures (13)

  • Figure 1: Zero-shot object counting comparison of our CountZES against recently proposed T2ICount qian2025t2icount and GeCo pelhan2024novel across diverse benchmarks spanning natural, aerial, and medical domains.
  • Figure 2: (a) Detectors like GroundingDINO liu2024grounding often yield very few or even fail to provide single instance boxes. (b) Counting is sensitive to exemplar choice. (c) Introducing CountZES, a training-free, multi-stage exemplar selection approach for ZOC.
  • Figure 3: Overview of CountZES. The pipeline operates in three stages. DAE refines GroundingDINO detections using CLIP similarity via SSES to obtain text-aligned single-instance regions. DGE then adopts a density-driven paradigm: it generates a density map conditioned on DAE exemplar, uses P2P prompting to identify candidate exemplars, applies RoI-based single-instance filtering, and selects the most reliable exemplar through GGES. Finally, FCE focuses on visual coherence by projecting candidate exemplars onto SAM’s feature map and using FRES to select the representative exemplar. An exemplar from each stage form a diverse exemplar set for final object count.
  • Figure 4: Overview of GGES module in the DGE stage. Per-box counts from single-instance candidates are used to estimate a pseudo-GT count via non-parametric density estimation. A SAM-based similarity map, conditioned on the DAE exemplar, captures semantic correspondence, and a composite score combining count proximity to the pseudo-GT and similarity-map entropy identifies the final density-guided exemplar.
  • Figure 5: Overview of FRES module in the FCE stage. Single-instance boxes $\mathcal{B}_{\text{single}}$ are projected onto SAM’s upsampled feature map $\Phi^{\text{up}}$ to obtain $\ell_2$-normalized regional embeddings. These embeddings are clustered, and the box closest to the majority-cluster centroid is selected as the representative exemplar.
  • ...and 8 more figures