Table of Contents
Fetching ...

A Study of Commonsense Reasoning over Visual Object Properties

Abhishek Kolari, Mohammadhossein Khojasteh, Yifan Jiang, Floris den Hengst, Filip Ilievski

TL;DR

This work tackles how to evaluate commonsense object property reasoning in visual scenes by introducing the OPTICS framework, which jointly considers four object-property dimensions, three levels of reasoning, and three image types. It instantiates this framework into two VQA benchmarks, OPTICS-CNT for counting and OPTICS-CMP for comparisons, totaling over 3,000 questions across 360 images, and evaluates twelve state-of-the-art VLMs in zero-shot settings. The results reveal significant gaps relative to human performance, with counting accuracy typically under 40% and comparisons around 58–70%, and highlight difficulties on photographic images and counterfactual scenarios. The authors provide data and code to enable scalable benchmarking and guide the development of advanced reasoning VLMs, with future work focusing on scalable question generation, robust guidelines, and richer reasoning architectures.

Abstract

Inspired by human categorization, object property reasoning involves identifying and recognizing low-level details and higher-level abstractions. While current visual question answering (VQA) studies consider multiple object properties, such as size, they typically blend perception and reasoning and lack representativeness in terms of reasoning and image categories, making it unclear whether and how vision-language models (VLMs) abstract and reason over depicted objects. To this end, we introduce a systematic evaluation framework comprising images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions, informed by prior work on common sense. We develop a procedure to instantiate this framework in two VQA object reasoning benchmarks: OPTICS-CNT, comprising 360 images paired with 1,080 multi-level, count-based questions, and OPTICS-CMP, with 2.1k comparison questions. Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations relative to humans, with the best-performing model achieving below 40% counting and 70% comparison accuracy. VLMs struggle particularly with photographic images, counterfactual reasoning, physical and functional properties, and higher counts. We make the OPTICS benchmark data and code available to support future work on scalable benchmarking methods, generalized annotation guidelines, and advanced reasoning VLMs.

A Study of Commonsense Reasoning over Visual Object Properties

TL;DR

This work tackles how to evaluate commonsense object property reasoning in visual scenes by introducing the OPTICS framework, which jointly considers four object-property dimensions, three levels of reasoning, and three image types. It instantiates this framework into two VQA benchmarks, OPTICS-CNT for counting and OPTICS-CMP for comparisons, totaling over 3,000 questions across 360 images, and evaluates twelve state-of-the-art VLMs in zero-shot settings. The results reveal significant gaps relative to human performance, with counting accuracy typically under 40% and comparisons around 58–70%, and highlight difficulties on photographic images and counterfactual scenarios. The authors provide data and code to enable scalable benchmarking and guide the development of advanced reasoning VLMs, with future work focusing on scalable question generation, robust guidelines, and richer reasoning architectures.

Abstract

Inspired by human categorization, object property reasoning involves identifying and recognizing low-level details and higher-level abstractions. While current visual question answering (VQA) studies consider multiple object properties, such as size, they typically blend perception and reasoning and lack representativeness in terms of reasoning and image categories, making it unclear whether and how vision-language models (VLMs) abstract and reason over depicted objects. To this end, we introduce a systematic evaluation framework comprising images of three representative types, three reasoning levels of increasing complexity, and four object property dimensions, informed by prior work on common sense. We develop a procedure to instantiate this framework in two VQA object reasoning benchmarks: OPTICS-CNT, comprising 360 images paired with 1,080 multi-level, count-based questions, and OPTICS-CMP, with 2.1k comparison questions. Experiments with 12 state-of-the-art VLMs in zero-shot settings reveal significant limitations relative to humans, with the best-performing model achieving below 40% counting and 70% comparison accuracy. VLMs struggle particularly with photographic images, counterfactual reasoning, physical and functional properties, and higher counts. We make the OPTICS benchmark data and code available to support future work on scalable benchmarking methods, generalized annotation guidelines, and advanced reasoning VLMs.

Paper Structure

This paper contains 29 sections, 16 figures, 10 tables.

Figures (16)

  • Figure 1: VQA in OPTICS-CNT, with questions about different object properties of varying reasoning complexity on three types of images from different domains.
  • Figure 2: OPTICS-CNT and OPTICS-CMP construction pipeline: image collection, candidate QA formulation, quality assurance, and pairwise combination. n and m are number of images and QAs, respectively.
  • Figure 3: OPTICS-CMP question template.
  • Figure 4: OPTICS-CNT question composition.
  • Figure 5: Accuracy per count (0-10) for human with an average (micro) accuracy of 73.89% and the top-3 best performing models: Qwen2.5-VL-Instruct (32B), InternVL3 (14B), and Qwen2.5-VL-Instruct (7B).
  • ...and 11 more figures