Table of Contents
Fetching ...

Zero-shot Object Counting with Good Exemplars

Huilin Zhu, Jingling Yuan, Zhengwei Yang, Yu Guo, Zheng Wang, Xian Zhong, Shengfeng He

TL;DR

This work tackles zero-shot object counting by addressing exemplar quality, a key bottleneck for scalable performance across unseen classes. It introduces VA-Count, a framework consisting of an Exemplar Enhancement Module (EEM) and a Noise Suppression Module (NSM) that jointly refine exemplar discovery and mitigate misidentifications, leveraging Vision-Language Pretaining models such as Grounding DINO for cross-modal alignment. The method defines density maps $D^p$, $D^n$, and $D^g$ and optimizes a combination of a contrastive loss $\mathcal{L}_C$ and a density regression loss $\mathcal{L}_D$ to improve counting accuracy, while enforcing single-object exemplars via a binary classifier $\delta(\cdot)$. Empirical results on FSC-147 and CARPK show state-of-the-art or competitive performance in zero-shot settings, with ablations confirming the contributions of single-object filtering, exemplar filtering, and contrastive density learning. Overall, VA-Count demonstrates strong generalization and scalability for zero-shot counting across diverse classes, highlighting the potential of Vision-Language models to bridge textual targets and visual content in counting tasks.

Abstract

Zero-shot object counting (ZOC) aims to enumerate objects in images using only the names of object classes during testing, without the need for manual annotations. However, a critical challenge in current ZOC methods lies in their inability to identify high-quality exemplars effectively. This deficiency hampers scalability across diverse classes and undermines the development of strong visual associations between the identified classes and image content. To this end, we propose the Visual Association-based Zero-shot Object Counting (VA-Count) framework. VA-Count consists of an Exemplar Enhancement Module (EEM) and a Noise Suppression Module (NSM) that synergistically refine the process of class exemplar identification while minimizing the consequences of incorrect object identification. The EEM utilizes advanced vision-language pretaining models to discover potential exemplars, ensuring the framework's adaptability to various classes. Meanwhile, the NSM employs contrastive learning to differentiate between optimal and suboptimal exemplar pairs, reducing the negative effects of erroneous exemplars. VA-Count demonstrates its effectiveness and scalability in zero-shot contexts with superior performance on two object counting datasets.

Zero-shot Object Counting with Good Exemplars

TL;DR

This work tackles zero-shot object counting by addressing exemplar quality, a key bottleneck for scalable performance across unseen classes. It introduces VA-Count, a framework consisting of an Exemplar Enhancement Module (EEM) and a Noise Suppression Module (NSM) that jointly refine exemplar discovery and mitigate misidentifications, leveraging Vision-Language Pretaining models such as Grounding DINO for cross-modal alignment. The method defines density maps , , and and optimizes a combination of a contrastive loss and a density regression loss to improve counting accuracy, while enforcing single-object exemplars via a binary classifier . Empirical results on FSC-147 and CARPK show state-of-the-art or competitive performance in zero-shot settings, with ablations confirming the contributions of single-object filtering, exemplar filtering, and contrastive density learning. Overall, VA-Count demonstrates strong generalization and scalability for zero-shot counting across diverse classes, highlighting the potential of Vision-Language models to bridge textual targets and visual content in counting tasks.

Abstract

Zero-shot object counting (ZOC) aims to enumerate objects in images using only the names of object classes during testing, without the need for manual annotations. However, a critical challenge in current ZOC methods lies in their inability to identify high-quality exemplars effectively. This deficiency hampers scalability across diverse classes and undermines the development of strong visual associations between the identified classes and image content. To this end, we propose the Visual Association-based Zero-shot Object Counting (VA-Count) framework. VA-Count consists of an Exemplar Enhancement Module (EEM) and a Noise Suppression Module (NSM) that synergistically refine the process of class exemplar identification while minimizing the consequences of incorrect object identification. The EEM utilizes advanced vision-language pretaining models to discover potential exemplars, ensuring the framework's adaptability to various classes. Meanwhile, the NSM employs contrastive learning to differentiate between optimal and suboptimal exemplar pairs, reducing the negative effects of erroneous exemplars. VA-Count demonstrates its effectiveness and scalability in zero-shot contexts with superior performance on two object counting datasets.
Paper Structure (23 sections, 12 equations, 11 figures, 6 tables, 1 algorithm)

This paper contains 23 sections, 12 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: Illustration of class-agnostic object counting methods. (a) Few-shot uses limited annotations for counting. (b) Reference-free quantifies objects without annotations. (c) Zero-shot counts specific classes without annotations, further divided into: (c1) Image-text association, leveraging direct image-text correlations. (c2) Class-related exemplar search, using prototypes to link classes with images. (c3) Our method introduces a detection-driven exemplar discovery to harmonize text with visual representations, distinguishing it from prior methods.
  • Figure 2: Overview of the proposed method. The proposed method focuses on two main elements: the Exemplar Enhancement Module (EEM) for improving exemplar quality through a patch selection integrated with Grounding DINO Liu2023DINO, and the Noise Suppression Module (NSM) that distinguishes between positive and negative class samples using density maps. It employs a Contrastive Loss function to refine the precision in identifying target class objects from others in an image.
  • Figure 3: Illustration of the single object exemplar filtering with a frozen Clip-vit encoder and a trainable FFN to distinguish single from multiple objects.
  • Figure 4: Illustration of heatmaps compared with few-shot method liu2022countr on FSC-147. The predicted density map is overlaid on the original RGB image. (Best viewed in zoom in)
  • Figure 5: Illustration of the final positive (Pos.) and negative (Neg.) exemplars for images on FSC-147.
  • ...and 6 more figures