Table of Contents
Fetching ...

CountEx: Fine-Grained Counting via Exemplars and Exclusion

Yifeng Huang, Gia Khanh Nguyen, Minh Hoai

TL;DR

The proposed CountEx enables users to express both inclusion and exclusion intent, specifying what to count and what to ignore, through multimodal prompts including natural language descriptions and optional visual exemplars.

Abstract

This paper presents CountEx, a discriminative visual counting framework designed to address a key limitation of existing prompt-based methods: the inability to explicitly exclude visually similar distractors. While current approaches allow users to specify what to count via inclusion prompts, they often struggle in cluttered scenes with confusable object categories, leading to ambiguity and overcounting. CountEx enables users to express both inclusion and exclusion intent, specifying what to count and what to ignore, through multimodal prompts including natural language descriptions and optional visual exemplars. At the core of CountEx is a novel Discriminative Query Refinement module, which jointly reasons over inclusion and exclusion cues by first identifying shared visual features, then isolating exclusion-specific patterns, and finally applying selective suppression to refine the counting query. To support systematic evaluation of fine-grained counting methods, we introduce CoCount, a benchmark comprising 1,780 videos and 10,086 annotated frames across 97 category pairs. Experiments show that CountEx achieves substantial improvements over state-of-the-art methods for counting objects from both known and novel categories. The data and code are available at https://github.com/bbvisual/CountEx.

CountEx: Fine-Grained Counting via Exemplars and Exclusion

TL;DR

The proposed CountEx enables users to express both inclusion and exclusion intent, specifying what to count and what to ignore, through multimodal prompts including natural language descriptions and optional visual exemplars.

Abstract

This paper presents CountEx, a discriminative visual counting framework designed to address a key limitation of existing prompt-based methods: the inability to explicitly exclude visually similar distractors. While current approaches allow users to specify what to count via inclusion prompts, they often struggle in cluttered scenes with confusable object categories, leading to ambiguity and overcounting. CountEx enables users to express both inclusion and exclusion intent, specifying what to count and what to ignore, through multimodal prompts including natural language descriptions and optional visual exemplars. At the core of CountEx is a novel Discriminative Query Refinement module, which jointly reasons over inclusion and exclusion cues by first identifying shared visual features, then isolating exclusion-specific patterns, and finally applying selective suppression to refine the counting query. To support systematic evaluation of fine-grained counting methods, we introduce CoCount, a benchmark comprising 1,780 videos and 10,086 annotated frames across 97 category pairs. Experiments show that CountEx achieves substantial improvements over state-of-the-art methods for counting objects from both known and novel categories. The data and code are available at https://github.com/bbvisual/CountEx.
Paper Structure (27 sections, 8 equations, 8 figures, 13 tables)

This paper contains 27 sections, 8 equations, 8 figures, 13 tables.

Figures (8)

  • Figure 1: Given a cluttered scene containing mutiple object categories, our method allows users to specify both inclusion and exclusion intent, e.g., "Count penne pasta, not spiral pasta", via language prompts and optional visual exemplars, enabling more precise and controllable counting.
  • Figure 2: Object categories and their variants across five super-categories on CoCount.
  • Figure 3: Overview of CountEx. Given an image and multimodal prompts (positive and negative text with optional visual exemplars), we encode them into query sets $\mathbf{Q}^{\text{pos}}$ and $\mathbf{Q}^{\text{neg}}$. Our Discriminative Query Refinement (DQR) module consists of three stages: (1) Shared Feature Identification learns prototypes $\mathbf{C}$ capturing common features between both query sets; (2) Exclusive Feature Extraction isolates negative-exclusive patterns $\mathbf{R}^{\text{neg}}$ by projecting $\mathbf{Q}^{\text{neg}}$ onto $\mathbf{C}$ and filtering residuals; (3) Selective Query Refinement produces refined queries $\tilde{\mathbf{Q}}^{\text{pos}}$ by selectively suppressing negative patterns via attention. The refined queries are fed to detection heads for final predictions. An auxiliary density prediction branch provides additional supervision during training.
  • Figure 4: Qualitative results on CoCount test set.
  • Figure 5: Impact of prompt specificity.
  • ...and 3 more figures