Table of Contents
Fetching ...

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

Paul Gavrikov, Wei Lin, M. Jehanzeb Mirza, Soumya Jahagirdar, Muhammad Huzaifa, Sivan Doveh, Serena Yeung-Levy, James Glass, Hilde Kuehne

TL;DR

VisualOverload introduces a dense-scene, high-resolution VQA benchmark built from 150 public-domain paintings, totaling 2,720 QA pairs across six core tasks. The dataset emphasizes ground-truth privacy, manual annotation, and three difficulty levels, and it uses an evaluation server to fairly compare 37 VLMs. Across tasks, models struggle with counting, OCR, and fine-grained reasoning, even as scene-classification performance remains relatively strong; error analyses reveal counting, OCR, and logical-consistency failures and shortcut biases. The work highlights fundamental gaps in current vision-language models when faced with visually overloaded scenes and provides a valuable resource for measuring and guiding improvements in robust perception and reasoning capabilities.

Abstract

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

VisualOverload: Probing Visual Understanding of VLMs in Really Dense Scenes

TL;DR

VisualOverload introduces a dense-scene, high-resolution VQA benchmark built from 150 public-domain paintings, totaling 2,720 QA pairs across six core tasks. The dataset emphasizes ground-truth privacy, manual annotation, and three difficulty levels, and it uses an evaluation server to fairly compare 37 VLMs. Across tasks, models struggle with counting, OCR, and fine-grained reasoning, even as scene-classification performance remains relatively strong; error analyses reveal counting, OCR, and logical-consistency failures and shortcut biases. The work highlights fundamental gaps in current vision-language models when faced with visually overloaded scenes and provides a valuable resource for measuring and guiding improvements in robust perception and reasoning capabilities.

Abstract

Is basic visual understanding really solved in state-of-the-art VLMs? We present VisualOverload, a slightly different visual question answering (VQA) benchmark comprising 2,720 question-answer pairs, with privately held ground-truth responses. Unlike prior VQA datasets that typically focus on near global image understanding, VisualOverload challenges models to perform simple, knowledge-free vision tasks in densely populated (or, overloaded) scenes. Our dataset consists of high-resolution scans of public-domain paintings that are populated with multiple figures, actions, and unfolding subplots set against elaborately detailed backdrops. We manually annotated these images with questions across six task categories to probe for a thorough understanding of the scene. We hypothesize that current benchmarks overestimate the performance of VLMs, and encoding and reasoning over details is still a challenging task for them, especially if they are confronted with densely populated scenes. Indeed, we observe that even the best model (o3) out of 37 tested models only achieves 19.6% accuracy on our hardest test split and overall 69.5% accuracy on all questions. Beyond a thorough evaluation, we complement our benchmark with an error analysis that reveals multiple failure modes, including a lack of counting skills, failure in OCR, and striking logical inconsistencies under complex tasks. Altogether, VisualOverload exposes a critical gap in current vision models and offers a crucial resource for the community to develop better models. Benchmark: http://paulgavrikov.github.io/visualoverload

Paper Structure

This paper contains 30 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Example questions from VisualOverload. Our benchmark consists of images displaying densely populated scenes paired with handcrafted questions (multiple-choice and free-form) covering six core vision tasks. All yes/no questions are paired with questions asking for a logical opposite question to decrease the random chance and to provide an additional signal for measuring logical consistency.
  • Figure 2: Insights into counting errors. All analyses display distributions over all model predictions exclusively for the counting task.
  • Figure 3: OCR prediction error distance.
  • Figure 4: Logical consistency.
  • Figure 5: Resolution ablation.
  • ...and 1 more figures