Table of Contents
Fetching ...

CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

Atin Pothiraj, Elias Stengel-Eskin, Jaemin Cho, Mohit Bansal

TL;DR

CAPTURe presents a novel benchmark to evaluate vision-language models on amodal counting in patterned scenes under occlusion, probing both pattern recognition and world modeling. The paper introduces CAPTURe_real and CAPTURe_synthetic datasets and a text-extraction-based counting protocol, evaluated on GPT-4o, Intern-VL2, Molmo, Qwen2-VL, MiniCPM-o, and Kimi-VL-A3B, with human and CountGD baselines. Results show that VLMs struggle to count occluded patterns, with occlusion generally increasing error; GPT-4o is the strongest, yet still far below human performance, while a pure counting model outperforms VLMs on occluded data. Analyses reveal that providing auxiliary information such as exact object coordinates or inpainted occluded regions substantially improves performance, highlighting gaps in visual world modeling and counting in occluded scenes. The work points to future directions for enhancing world-model reasoning in VLMs, including integrating structured spatial cues and targeted pretraining.

Abstract

Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT-4o, Intern-VL2, Molmo, and Qwen2-VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT-4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty in counting in images. Code and data: https://github.com/atinpothiraj/CAPTURe

CAPTURe: Evaluating Spatial Reasoning in Vision Language Models via Occluded Object Counting

TL;DR

CAPTURe presents a novel benchmark to evaluate vision-language models on amodal counting in patterned scenes under occlusion, probing both pattern recognition and world modeling. The paper introduces CAPTURe_real and CAPTURe_synthetic datasets and a text-extraction-based counting protocol, evaluated on GPT-4o, Intern-VL2, Molmo, Qwen2-VL, MiniCPM-o, and Kimi-VL-A3B, with human and CountGD baselines. Results show that VLMs struggle to count occluded patterns, with occlusion generally increasing error; GPT-4o is the strongest, yet still far below human performance, while a pure counting model outperforms VLMs on occluded data. Analyses reveal that providing auxiliary information such as exact object coordinates or inpainted occluded regions substantially improves performance, highlighting gaps in visual world modeling and counting in occluded scenes. The work points to future directions for enhancing world-model reasoning in VLMs, including integrating structured spatial cues and targeted pretraining.

Abstract

Recognizing and reasoning about occluded (partially or fully hidden) objects is vital to understanding visual scenes, as occlusions frequently occur in real-world environments and act as obstacles for spatial comprehension. To test models' ability to reason about multiple occluded objects, we introduce a novel task, Counting Amodally for Patterns Through Unseen REgions (CAPTURe), which requires a model to count objects arranged in a pattern by inferring how the pattern continues behind an occluder (an object which blocks parts of the scene). CAPTURe requires both recognizing visual patterns and reasoning, making it a useful testbed for evaluating vision-language models (VLMs) on whether they understand occluded patterns and possess spatial understanding skills. By requiring models to reason about occluded objects, CAPTURe also tests VLMs' ability to form world models that would allow them to fill in missing information. CAPTURe consists of two parts: (1) CAPTURe-real, with manually filtered images of real objects in patterns and (2) CAPTURe-synthetic, a controlled diagnostic with generated patterned images. We evaluate four strong VLMs (GPT-4o, Intern-VL2, Molmo, and Qwen2-VL) on CAPTURe, finding that models struggle to count on both occluded and unoccluded patterns. Crucially, we find that models perform worse with occlusion, suggesting that VLMs are also deficient in inferring unseen spatial relationships: even the strongest VLMs like GPT-4o fail to count with occlusion. In contrast, we find that humans achieve very little error on CAPTURe. We also find that providing auxiliary information of occluded object locations increases performance, underscoring that the model error comes both from an inability to handle occlusion as well as difficulty in counting in images. Code and data: https://github.com/atinpothiraj/CAPTURe

Paper Structure

This paper contains 27 sections, 2 equations, 11 figures, 8 tables.

Figures (11)

  • Figure 1: CAPTURe example with an output from GPT-4o. While people can easily infer the missing number of cups and correctly reason over occluded patterns, models generally struggle to reason over these occluded scenes.
  • Figure 2: Example images with GPT-4o responses to CAPTURe$^\text{real}$ and CAPTURe$^\text{synthetic}$ occluded splits.
  • Figure 3: # of objects in CAPTURe$^\text{real}$ images.
  • Figure 4: # of occluded objects in CAPTURe$^\text{synthetic}$ images.
  • Figure 5: VLM vs. VLM + CountGD hybrid on questions from the CAPTURe$^\text{real}$ (occluded split) that are not in CountGD training set. Metric: sMAPE (lower is better).
  • ...and 6 more figures