Table of Contents
Fetching ...

LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models

Muhammad Fetrat Qharabagh, Mohammadreza Ghofrani, Kimon Fountoulakis

TL;DR

This work proposes a simple yet effective baseline method that enhances LVLMs' counting ability for large numbers of objects using a divide-and-conquer approach, and incorporates a mechanism to prevent objects from being split during division, which could otherwise lead to repetitive counting.

Abstract

Counting is a fundamental operation for various real-world visual tasks, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) are known to struggle with counting tasks. In this work, we evaluate the performance of several LVLMs on visual counting tasks across multiple counting and vision datasets. We observe that while their performance may be less prone to error for small numbers of objects, they exhibit significant weaknesses as the number of objects increases. To alleviate this issue, we propose a simple yet effective baseline method that enhances LVLMs' counting ability for large numbers of objects using a divide-and-conquer approach. Our method decomposes counting problems into sub-tasks. Moreover, it incorporates a mechanism to prevent objects from being split during division, which could otherwise lead to repetitive counting -- a common issue in a naive divide-and-conquer implementation. We demonstrate the effectiveness of this approach across various datasets and benchmarks, establishing it as a valuable reference for evaluating future solutions.

LVLM-COUNT: Enhancing the Counting Ability of Large Vision-Language Models

TL;DR

This work proposes a simple yet effective baseline method that enhances LVLMs' counting ability for large numbers of objects using a divide-and-conquer approach, and incorporates a mechanism to prevent objects from being split during division, which could otherwise lead to repetitive counting.

Abstract

Counting is a fundamental operation for various real-world visual tasks, requiring both object recognition and robust counting capabilities. Despite their advanced visual perception, large vision-language models (LVLMs) are known to struggle with counting tasks. In this work, we evaluate the performance of several LVLMs on visual counting tasks across multiple counting and vision datasets. We observe that while their performance may be less prone to error for small numbers of objects, they exhibit significant weaknesses as the number of objects increases. To alleviate this issue, we propose a simple yet effective baseline method that enhances LVLMs' counting ability for large numbers of objects using a divide-and-conquer approach. Our method decomposes counting problems into sub-tasks. Moreover, it incorporates a mechanism to prevent objects from being split during division, which could otherwise lead to repetitive counting -- a common issue in a naive divide-and-conquer implementation. We demonstrate the effectiveness of this approach across various datasets and benchmarks, establishing it as a valuable reference for evaluating future solutions.

Paper Structure

This paper contains 31 sections, 26 figures, 25 tables.

Figures (26)

  • Figure 1: Illustration of our proposed pipeline. First, an expression ($E$) describing the area of interest is extracted from the prompted question ($Q$), such as "brown eggs". The expression is extracted using a large language model (LLM) which is the same as LVLM in our work. Then, $E$ and the image are provided as input to a grounding model, such as the one by liu2023grounding to detect the area of interest. Second, any objects corresponding to $E$ are segmented. Third, in the object-aware division step, we use the segmentation masks to divide the detected area of interest without cutting through the objects of interest. Finally, the number of objects of interest in each sub-image is computed using an LVLM, and the results are aggregated.
  • Figure 2: Comparison of the naive and the object-aware division. The objects of interest are the circles. In \ref{['fig:blue_circle_sub1']}, we illustrate a naive division of the input image, which is divided into equally sized sub-images with straight lines. In \ref{['fig:blue_circle_sub2']}, we illustrate the object-aware division, which avoids cutting through circles. In \ref{['fig:blue_circle_sub3']}, we illustrate the counting error of GPT-4o for images with randomly positioned circles. The absolute counting error is the absolute difference between the ground truth and the number predicted by GPT-4o. The results are averaged over three trials.
  • Figure 3: Illustration of the area detection stage of LVLM-Count. For this image, $Q$ is set to "How many brown eggs are in the image". The LLM that is used in this step returns an $E$ which is "brown eggs". $E$ and the original image are given as input to GroundingDINO, which returns a bounding box. If the grounding model returns multiple bounding boxes, they are merged to form the final detected area.
  • Figure 4: Illustration of the target segmentation step of LVLM-Count. The goal is to produce all the instance masks for $E$ set to "brown egg". The cropped detected area from \ref{['fig:area_detection']}, together with $E$, is given as input to GroundingDINO, which produces the output shown in \ref{['fig:bounding_boxes']}. \ref{['fig:bounding_boxes']} is then given as input to SAM, which produces the output shown in \ref{['fig:eggs']}.
  • Figure 5: Illustration of the unsupervised and non-parametric method to obtain the division points ($P^1_s,P^1_e$), and ($P^2_s,P^2_e$,). A few pixels are sampled (shown as points inside the segmented objects) from the pixels composing target masks. The samples are projected onto the $x$-axis. The projected points are clustered using mean-shift clustering. The point in the middle of two consecutive clusters is considered a vertical division point. Blue lines are solely for illustration
  • ...and 21 more figures