Mind the Prompt: A Novel Benchmark for Prompt-based Class-Agnostic Counting
Luca Ciampi, Nicola Messina, Matteo Pierucci, Giuseppe Amato, Marco Avvenuti, Fabrizio Falchi
TL;DR
This work identifies fundamental shortcomings in evaluating prompt-based CAC, arguing that existing benchmarks fail to test whether models truly understand which object to count. It introduces PrACo, a benchmark with a negative-label test and a mosaic test, plus dedicated metrics (NMN, PCCN, CntP, CntR, CntF1) to quantify prompt understanding and distractor robustness. Experiments on FSC-147 show that state-of-the-art CAC methods can achieve strong traditional counting metrics yet perform poorly on PrACo, underscoring the need for new training strategies or architectural designs. By providing a rigorous, prompt-aware evaluation framework, PrACo aims to drive the development of more reliable and trustworthy prompt-based CAC systems with real-world applicability.
Abstract
Recently, object counting has shifted towards class-agnostic counting (CAC), which counts instances of arbitrary object classes never seen during model training. With advancements in robust vision-and-language foundation models, there is a growing interest in prompt-based CAC, where object categories are specified using natural language. However, we identify significant limitations in current benchmarks for evaluating this task, which hinder both accurate assessment and the development of more effective solutions. Specifically, we argue that the current evaluation protocols do not measure the ability of the model to understand which object has to be counted. This is due to two main factors: (i) the shortcomings of CAC datasets, which primarily consist of images containing objects from a single class, and (ii) the limitations of current counting performance evaluators, which are based on traditional class-specific counting and focus solely on counting errors. To fill this gap, we introduce the Prompt-Aware Counting (PrACo) benchmark. It comprises two targeted tests coupled with evaluation metrics specifically designed to quantitatively measure the robustness and trustworthiness of existing prompt-based CAC models. We evaluate state-of-the-art methods and demonstrate that, although some achieve impressive results on standard class-specific counting metrics, they exhibit a significant deficiency in understanding the input prompt, indicating the need for more careful training procedures or revised designs. The code for reproducing our results is available at https://github.com/ciampluca/PrACo.
