Table of Contents
Fetching ...

Object Counting with GPT-4o and GPT-5: A Comparative Study

Richard Füzesséry, Kaziwa Saleh, Sándor Szénási, Zoltán Vámossy

TL;DR

The study investigates zero-shot object counting by leveraging pre-trained multimodal LLMs (GPT-4o and GPT-5) with text prompts, without visual exemplars, and evaluates on FSC-147 and CARPK to gauge open-world counting capabilities. The method uses two prompts and multiple runs to obtain stable counts, revealing that GPT-5 generally yields higher accuracy than GPT-4o on FSC-147 while both struggle more on CARPK and in highly crowded scenes. The results show GPT-5 can reach competitive zero-shot performance on FSC-147, approaching state-of-the-art baselines, though limitations include cost and lack of visual exemplars. The work highlights the potential of using text-only prompts with vision-language models for open-world counting and points to future enhancements via descriptions from auxiliary VLMs.

Abstract

Zero-shot object counting attempts to estimate the number of object instances belonging to novel categories that the vision model performing the counting has never encountered during training. Existing methods typically require large amount of annotated data and often require visual exemplars to guide the counting process. However, large language models (LLMs) are powerful tools with remarkable reasoning and data understanding abilities, which suggest the possibility of utilizing them for counting tasks without any supervision. In this work we aim to leverage the visual capabilities of two multi-modal LLMs, GPT-4o and GPT-5, to perform object counting in a zero-shot manner using only textual prompts. We evaluate both models on the FSC-147 and CARPK datasets and provide a comparative analysis. Our findings show that the models achieve performance comparable to the state-of-the-art zero-shot approaches on FSC-147, in some cases, even surpass them.

Object Counting with GPT-4o and GPT-5: A Comparative Study

TL;DR

The study investigates zero-shot object counting by leveraging pre-trained multimodal LLMs (GPT-4o and GPT-5) with text prompts, without visual exemplars, and evaluates on FSC-147 and CARPK to gauge open-world counting capabilities. The method uses two prompts and multiple runs to obtain stable counts, revealing that GPT-5 generally yields higher accuracy than GPT-4o on FSC-147 while both struggle more on CARPK and in highly crowded scenes. The results show GPT-5 can reach competitive zero-shot performance on FSC-147, approaching state-of-the-art baselines, though limitations include cost and lack of visual exemplars. The work highlights the potential of using text-only prompts with vision-language models for open-world counting and points to future enhancements via descriptions from auxiliary VLMs.

Abstract

Zero-shot object counting attempts to estimate the number of object instances belonging to novel categories that the vision model performing the counting has never encountered during training. Existing methods typically require large amount of annotated data and often require visual exemplars to guide the counting process. However, large language models (LLMs) are powerful tools with remarkable reasoning and data understanding abilities, which suggest the possibility of utilizing them for counting tasks without any supervision. In this work we aim to leverage the visual capabilities of two multi-modal LLMs, GPT-4o and GPT-5, to perform object counting in a zero-shot manner using only textual prompts. We evaluate both models on the FSC-147 and CARPK datasets and provide a comparative analysis. Our findings show that the models achieve performance comparable to the state-of-the-art zero-shot approaches on FSC-147, in some cases, even surpass them.

Paper Structure

This paper contains 9 sections, 1 equation, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Examples of results produced by GPT-4o and GPT-5 using the second prompt. The predicted counts from each model are shown as P$_{GPT-4o}$ and P$_{GPT-5}$, respectively. In most cases, GPT-5 generates more accurate predictions, however, GPT-4 sometimes performs better.
  • Figure 2: Example of prompts provided to the pre-trained models for object counting. The object classes were obtained directly from the dataset.
  • Figure 3: Example of images where the pre-trained GPT-4o and GPT-5 models do not produce correct counting predictions. P$_{GPT-4o}$, P$_{GPT-5}$ present the predictions of both models.