Can Vision-Language Models Count? A Synthetic Benchmark and Analysis of Attention-Based Interventions
Saurav Sengupta, Nazanin Moradinasab, Jiebei Liu, Donald E. Brown
TL;DR
This work probes why Vision-Language Models struggle with counting by introducing a synthetic benchmark and a rich diagnostic framework that varies prompts and visual properties. It combines open-source VLM evaluations with attention-reweighting interventions to causally modulate visual grounding at inference time, revealing that counting performance is hindered more by enumerative binding under cognitive load than by pure compositional reasoning. Key findings include the non-monotonic effect of prompt specificity, degradation of counting with visual density and texture complexity, and the surprising efficacy of suppression over amplification in attention strategies, all with architecture- and layer-specific dynamics. The study provides a practical, interpretable toolkit for diagnosing and potentially remediating counting failures in VLMs, and it motivates architecture-tailored attention mechanisms and richer evaluation frameworks for visual enumeration tasks.
Abstract
Recent research suggests that Vision Language Models (VLMs) often rely on inherent biases learned during training when responding to queries about visual properties of images. These biases are exacerbated when VLMs are asked highly specific questions that require them to focus on particular areas of the image in tasks such as counting. We build upon this research by developing a synthetic benchmark dataset and evaluation framework to systematically determine how counting performance varies as image and prompt properties change. Using open-source VLMs, we then analyze how attention allocation fluctuates with varying input parameters (e.g. number of objects in the image, objects color, background color, objects texture, background texture, and prompt specificity). We further implement attention-based interventions to modulate focus on visual tokens at different layers and evaluate their impact on counting performance across a range of visual conditions. Our experiments reveal that while VLM counting performance remains challenging, especially under high visual or linguistic complexity, certain attention interventions can lead to modest gains in counting performance.
