Table of Contents
Fetching ...

Mind the Prompt: A Novel Benchmark for Prompt-based Class-Agnostic Counting

Luca Ciampi, Nicola Messina, Matteo Pierucci, Giuseppe Amato, Marco Avvenuti, Fabrizio Falchi

TL;DR

This work identifies fundamental shortcomings in evaluating prompt-based CAC, arguing that existing benchmarks fail to test whether models truly understand which object to count. It introduces PrACo, a benchmark with a negative-label test and a mosaic test, plus dedicated metrics (NMN, PCCN, CntP, CntR, CntF1) to quantify prompt understanding and distractor robustness. Experiments on FSC-147 show that state-of-the-art CAC methods can achieve strong traditional counting metrics yet perform poorly on PrACo, underscoring the need for new training strategies or architectural designs. By providing a rigorous, prompt-aware evaluation framework, PrACo aims to drive the development of more reliable and trustworthy prompt-based CAC systems with real-world applicability.

Abstract

Recently, object counting has shifted towards class-agnostic counting (CAC), which counts instances of arbitrary object classes never seen during model training. With advancements in robust vision-and-language foundation models, there is a growing interest in prompt-based CAC, where object categories are specified using natural language. However, we identify significant limitations in current benchmarks for evaluating this task, which hinder both accurate assessment and the development of more effective solutions. Specifically, we argue that the current evaluation protocols do not measure the ability of the model to understand which object has to be counted. This is due to two main factors: (i) the shortcomings of CAC datasets, which primarily consist of images containing objects from a single class, and (ii) the limitations of current counting performance evaluators, which are based on traditional class-specific counting and focus solely on counting errors. To fill this gap, we introduce the Prompt-Aware Counting (PrACo) benchmark. It comprises two targeted tests coupled with evaluation metrics specifically designed to quantitatively measure the robustness and trustworthiness of existing prompt-based CAC models. We evaluate state-of-the-art methods and demonstrate that, although some achieve impressive results on standard class-specific counting metrics, they exhibit a significant deficiency in understanding the input prompt, indicating the need for more careful training procedures or revised designs. The code for reproducing our results is available at https://github.com/ciampluca/PrACo.

Mind the Prompt: A Novel Benchmark for Prompt-based Class-Agnostic Counting

TL;DR

This work identifies fundamental shortcomings in evaluating prompt-based CAC, arguing that existing benchmarks fail to test whether models truly understand which object to count. It introduces PrACo, a benchmark with a negative-label test and a mosaic test, plus dedicated metrics (NMN, PCCN, CntP, CntR, CntF1) to quantify prompt understanding and distractor robustness. Experiments on FSC-147 show that state-of-the-art CAC methods can achieve strong traditional counting metrics yet perform poorly on PrACo, underscoring the need for new training strategies or architectural designs. By providing a rigorous, prompt-aware evaluation framework, PrACo aims to drive the development of more reliable and trustworthy prompt-based CAC systems with real-world applicability.

Abstract

Recently, object counting has shifted towards class-agnostic counting (CAC), which counts instances of arbitrary object classes never seen during model training. With advancements in robust vision-and-language foundation models, there is a growing interest in prompt-based CAC, where object categories are specified using natural language. However, we identify significant limitations in current benchmarks for evaluating this task, which hinder both accurate assessment and the development of more effective solutions. Specifically, we argue that the current evaluation protocols do not measure the ability of the model to understand which object has to be counted. This is due to two main factors: (i) the shortcomings of CAC datasets, which primarily consist of images containing objects from a single class, and (ii) the limitations of current counting performance evaluators, which are based on traditional class-specific counting and focus solely on counting errors. To fill this gap, we introduce the Prompt-Aware Counting (PrACo) benchmark. It comprises two targeted tests coupled with evaluation metrics specifically designed to quantitatively measure the robustness and trustworthiness of existing prompt-based CAC models. We evaluate state-of-the-art methods and demonstrate that, although some achieve impressive results on standard class-specific counting metrics, they exhibit a significant deficiency in understanding the input prompt, indicating the need for more careful training procedures or revised designs. The code for reproducing our results is available at https://github.com/ciampluca/PrACo.
Paper Structure (30 sections, 14 equations, 9 figures, 3 tables)

This paper contains 30 sections, 14 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Prompt-based counting models -- CounTX AminiNaieni23 in this example -- exhibit difficulties in accurately interpreting user-provided texts that specify object classes to be counted. The confusion occurs even between classes that are semantically very distinct -- like marbles and elephants. In some cases, the count of classes not present in the image is even higher than that for the ground-truth object category (highlighted in orange).
  • Figure 2: Inference schemas for the negative-label test (on the left) and the mosaic test (on the right). The numbers reported in the boxes show the ideal model outcomes: for the negative-label test, the diagonal of the shown matrix should be filled with the ground-truth object counts and with zeros elsewhere; for the mosaic test, each mosaic outcome should contain the ground-truth counts on the top -- which is the same for each row of the shown matrix -- and zeros on the bottoms. In the dotted boxes on the bottom, we report a schema of the inference procedures needed for computing each entry of the two matrices.
  • Figure 3: Example of the derivation of TPs and FPs in the mosaic scenario. In the shown case where the model predicts $c^\text{pos} = 3$ and $c^\text{neg} = 20$, the number of estimated true positives is bounded to the ground-truth value (15). The remaining 5 counted elements are considered false positives from the positive image ($\text{FP}^\text{pos} = 5$), which are then merged with the false positives from the negative image ($\text{FP}^\text{neg} = c^\text{neg} = 3$), to obtain a total of 8 FPs.
  • Figure 4: Boxplot showing the distribution of the correct count drifts of the different models. Despite the lower mean value, TFPOC and DAVE show a consistent number of outliers, revealing that they may catastrophically fail in some specific conditions.
  • Figure 5: This figure shows, for each model, the output density maps for three different (mosaic, input prompt) pairs. The count reported in the blue box is $c^\text{pos}_{ij}$, while the count reported in the dark orange box corresponds to $c^\text{neg}_{ij}$. We can notice how the models often misidentify instances from the negative image in the mosaic, though most accurately estimate the positive instances in the upper part.
  • ...and 4 more figures