Table of Contents
Fetching ...

CountGD: Multi-Modal Open-World Counting

Niki Amini-Naieni, Tengda Han, Andrew Zisserman

TL;DR

CountGD tackles open-world counting by enabling counts with text, visual exemplars, or both, extending the GroundingDINO foundation with exemplar embeddings and a counting head in a single-stage architecture. It fuses multi-modal prompts through a sequence of image, exemplar, and text encoders, a feature enhancer, and a cross-modality decoder to produce a count via a learned similarity matrix, with losses that emphasize localization and classification and Hungarian matching. Empirical results on FSC-147, CARPK, and CountBench show state-of-the-art performance when using both modalities, while text-only performance remains competitive with the best open-world text-based methods. The paper also explores interactions between text and exemplars, demonstrating that language can refine exemplar-based cues and that modality fusion yields interpretable improvements in counting accuracy and flexibility. Overall, CountGD significantly broadens open-world counting capabilities and demonstrates strong generalization across datasets in a multi-modal setting.

Abstract

The goal of this paper is to improve the generality and accuracy of open-vocabulary object counting in images. To improve the generality, we repurpose an open-vocabulary detection foundation model (GroundingDINO) for the counting task, and also extend its capabilities by introducing modules to enable specifying the target object to count by visual exemplars. In turn, these new capabilities - being able to specify the target object by multi-modalites (text and exemplars) - lead to an improvement in counting accuracy. We make three contributions: First, we introduce the first open-world counting model, CountGD, where the prompt can be specified by a text description or visual exemplars or both; Second, we show that the performance of the model significantly improves the state of the art on multiple counting benchmarks - when using text only, CountGD is comparable to or outperforms all previous text-only works, and when using both text and visual exemplars, we outperform all previous models; Third, we carry out a preliminary study into different interactions between the text and visual exemplar prompts, including the cases where they reinforce each other and where one restricts the other. The code and an app to test the model are available at https://www.robots.ox.ac.uk/~vgg/research/countgd/.

CountGD: Multi-Modal Open-World Counting

TL;DR

CountGD tackles open-world counting by enabling counts with text, visual exemplars, or both, extending the GroundingDINO foundation with exemplar embeddings and a counting head in a single-stage architecture. It fuses multi-modal prompts through a sequence of image, exemplar, and text encoders, a feature enhancer, and a cross-modality decoder to produce a count via a learned similarity matrix, with losses that emphasize localization and classification and Hungarian matching. Empirical results on FSC-147, CARPK, and CountBench show state-of-the-art performance when using both modalities, while text-only performance remains competitive with the best open-world text-based methods. The paper also explores interactions between text and exemplars, demonstrating that language can refine exemplar-based cues and that modality fusion yields interpretable improvements in counting accuracy and flexibility. Overall, CountGD significantly broadens open-world counting capabilities and demonstrates strong generalization across datasets in a multi-modal setting.

Abstract

The goal of this paper is to improve the generality and accuracy of open-vocabulary object counting in images. To improve the generality, we repurpose an open-vocabulary detection foundation model (GroundingDINO) for the counting task, and also extend its capabilities by introducing modules to enable specifying the target object to count by visual exemplars. In turn, these new capabilities - being able to specify the target object by multi-modalites (text and exemplars) - lead to an improvement in counting accuracy. We make three contributions: First, we introduce the first open-world counting model, CountGD, where the prompt can be specified by a text description or visual exemplars or both; Second, we show that the performance of the model significantly improves the state of the art on multiple counting benchmarks - when using text only, CountGD is comparable to or outperforms all previous text-only works, and when using both text and visual exemplars, we outperform all previous models; Third, we carry out a preliminary study into different interactions between the text and visual exemplar prompts, including the cases where they reinforce each other and where one restricts the other. The code and an app to test the model are available at https://www.robots.ox.ac.uk/~vgg/research/countgd/.
Paper Structure (48 sections, 4 equations, 8 figures, 4 tables)

This paper contains 48 sections, 4 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: CountGD is capable of taking both visual exemplars and text prompts to produce highly accurate object counts (a), but also seamlessly supports counting with only text queries or only visual exemplars (b). The multi-modal visual exemplar and text queries bring extra flexibility to the open-world counting task, such as using a short phrase (c), or adding additional constraints (the words 'left' or 'right') to select a sub-set of the objects (d). These examples are taken from the FSC-147 m_Ranjan-etal-CVPR21 and CountBench paiss2023countclip test sets. The visual exemplars are shown as yellow boxes. (d) visualizes the predicted confidence map of the model, where a high color intensity indicates a high level of confidence.
  • Figure 2: The CountGD architecture. At inference the object to be counted can be specified by visual exemplars or text prompts or both. The input image is passed through the image encoder, $f_{\boldsymbol{\theta_\text{SwinT}}}$ to obtain spatial feature maps at different scales. The visual exemplar tokens are cropped out of this feature map using RoIAlign (as shown in Figure \ref{['fig:visual_encoder']}). The text is passed through the text encoder, $f_{\boldsymbol{\theta_\text{TT}}}$ to obtain text tokens. In the feature enhancer, $f_{\boldsymbol{\varphi}}$, the visual exemplar tokens and text tokens are fused together with self-attention and cross-attend to the image features, producing the fused visual exemplar and text features, $\mathbf{z_{v, t}}$, and new image features, $\mathbf{z_{I}}$. The $k$ image features $\mathbf{z_{I}}$ that have the highest cosine similarity with the fused features $\mathbf{z_{v, t}}$ are passed to the cross-modality decoder, $f_{\boldsymbol{\psi}}$, as "cross-modality queries". Finally, the similarity matrix, $\mathbf{\hat{Y}}$ between the outputs of the cross-modality decoder, $f_{\boldsymbol{\psi}}$, and $\mathbf{z_{v, t}}$ is calculated, and outputs that achieve a maximum similarity with the $\mathbf{z_{v, t}}$ above a confidence threshold $\sigma$ are identified as final detections and enumerated to estimate the final count. Our model is built on top of GroundingDINO liu2023grounding architecture with the additional modules indicated by blue shading.
  • Figure 3: The visual feature extraction pipeline for images and visual exemplars. (a) For the input image, a standard Swin Transformer model is used to extract visual feature maps at multiple spatial resolutions. (b) For the visual exemplars with their corresponding bounding boxes, we first up-scale the multiple visual feature maps of the input image to the same resolution, then concatenate these feature maps, and project them to 256 channels with a $1\times 1$ convolution. Finally, we apply a RoIAlign with the bounding box coordinates to get the visual features for the exemplars.
  • Figure 4: Qualitative counting results on FSC-147 m_Ranjan-etal-CVPR21 and CountBench paiss2023countclip using the multi-modal CountGD. The model is trained and tested on FSC-147 visual exemplars and text. Input text is written above each image, and visual exemplars are indicated by the red boxes. On CountBench, we test the same model trained on the FSC-147 in a zero-shot way with only text (there are no visual exemplars for CountBench). Blue words indicate the subject of each caption input to the model. In both cases, CountGD predicts the count in all images shown with 100% accuracy. Note on the CountBench examples, the model counts the specified objects correctly when there are multiple types of objects in the image, such as the tomatoes with cucumbers, and the girls with bubbles. Detected points are filtered with a Gaussian and plotted under the input images for visualization purposes.
  • Figure 5: Studying visual exemplar and text interactions. We plot the confidence scores of the instances for each image. In (a) and (b) we show we can specify shape with the exemplar and modify color with text. In (c) we show we can specify spatial location with text, and shape with the exemplar.
  • ...and 3 more figures