Object Counting with GPT-4o and GPT-5: A Comparative Study
Richard Füzesséry, Kaziwa Saleh, Sándor Szénási, Zoltán Vámossy
TL;DR
The study investigates zero-shot object counting by leveraging pre-trained multimodal LLMs (GPT-4o and GPT-5) with text prompts, without visual exemplars, and evaluates on FSC-147 and CARPK to gauge open-world counting capabilities. The method uses two prompts and multiple runs to obtain stable counts, revealing that GPT-5 generally yields higher accuracy than GPT-4o on FSC-147 while both struggle more on CARPK and in highly crowded scenes. The results show GPT-5 can reach competitive zero-shot performance on FSC-147, approaching state-of-the-art baselines, though limitations include cost and lack of visual exemplars. The work highlights the potential of using text-only prompts with vision-language models for open-world counting and points to future enhancements via descriptions from auxiliary VLMs.
Abstract
Zero-shot object counting attempts to estimate the number of object instances belonging to novel categories that the vision model performing the counting has never encountered during training. Existing methods typically require large amount of annotated data and often require visual exemplars to guide the counting process. However, large language models (LLMs) are powerful tools with remarkable reasoning and data understanding abilities, which suggest the possibility of utilizing them for counting tasks without any supervision. In this work we aim to leverage the visual capabilities of two multi-modal LLMs, GPT-4o and GPT-5, to perform object counting in a zero-shot manner using only textual prompts. We evaluate both models on the FSC-147 and CARPK datasets and provide a comparative analysis. Our findings show that the models achieve performance comparable to the state-of-the-art zero-shot approaches on FSC-147, in some cases, even surpass them.
