Table of Contents
Fetching ...

Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models

Jonathan Lee, Xingrui Wang, Jiawei Peng, Luoxin Ye, Zehan Zheng, Tiezheng Zhang, Tao Wang, Wufei Ma, Siyi Chen, Yu-Cheng Chou, Prakhar Kaushik, Alan Yuille

TL;DR

This work introduces Perceptual Taxonomy (PercepTax), a benchmark and framework for evaluating vision-language models on physically grounded, hierarchical scene understanding. It defines a three-level representation (scene, objects, properties) and a taxonomy spanning material, affordance, function, and physical properties across 3,173 objects, annotated in 4,544 synthetic and 1,258 real images, yielding 28,033 template-based questions plus 50 expert questions. Evaluation across state-of-the-art closed- and open-source VLMs reveals strong object recognition but notable weaknesses in property-level and taxonomy reasoning, with performance gaps that widen in real-world images. An in-context, taxonomic reasoning paradigm using simulated exemplars improves real-image performance, demonstrating sim-to-real transfer and highlighting the need for structure-aware objectives to achieve robust physically grounded reasoning in VLMs.

Abstract

We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties such as material, affordance, function, and physical attributes to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evaluation of this ability and instead focus on surface-level recognition or image-text alignment. To address this gap, we introduce Perceptual Taxonomy, a benchmark for physically grounded visual reasoning. We annotate 3173 objects with four property families covering 84 fine-grained attributes. Using these annotations, we construct a multiple-choice question benchmark with 5802 images across both synthetic and real domains. The benchmark contains 28033 template-based questions spanning four types (object description, spatial reasoning, property matching, and taxonomy reasoning), along with 50 expert-crafted questions designed to evaluate models across the full spectrum of perceptual taxonomy reasoning. Experimental results show that leading vision-language models perform well on recognition tasks but degrade by 10 to 20 percent on property-driven questions, especially those requiring multi-step reasoning over structured attributes. These findings highlight a persistent gap in structured visual understanding and the limitations of current models that rely heavily on pattern matching. We also show that providing in-context reasoning examples from simulated scenes improves performance on real-world and expert-curated questions, demonstrating the effectiveness of perceptual-taxonomy-guided prompting.

Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models

TL;DR

This work introduces Perceptual Taxonomy (PercepTax), a benchmark and framework for evaluating vision-language models on physically grounded, hierarchical scene understanding. It defines a three-level representation (scene, objects, properties) and a taxonomy spanning material, affordance, function, and physical properties across 3,173 objects, annotated in 4,544 synthetic and 1,258 real images, yielding 28,033 template-based questions plus 50 expert questions. Evaluation across state-of-the-art closed- and open-source VLMs reveals strong object recognition but notable weaknesses in property-level and taxonomy reasoning, with performance gaps that widen in real-world images. An in-context, taxonomic reasoning paradigm using simulated exemplars improves real-image performance, demonstrating sim-to-real transfer and highlighting the need for structure-aware objectives to achieve robust physically grounded reasoning in VLMs.

Abstract

We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties such as material, affordance, function, and physical attributes to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evaluation of this ability and instead focus on surface-level recognition or image-text alignment. To address this gap, we introduce Perceptual Taxonomy, a benchmark for physically grounded visual reasoning. We annotate 3173 objects with four property families covering 84 fine-grained attributes. Using these annotations, we construct a multiple-choice question benchmark with 5802 images across both synthetic and real domains. The benchmark contains 28033 template-based questions spanning four types (object description, spatial reasoning, property matching, and taxonomy reasoning), along with 50 expert-crafted questions designed to evaluate models across the full spectrum of perceptual taxonomy reasoning. Experimental results show that leading vision-language models perform well on recognition tasks but degrade by 10 to 20 percent on property-driven questions, especially those requiring multi-step reasoning over structured attributes. These findings highlight a persistent gap in structured visual understanding and the limitations of current models that rely heavily on pattern matching. We also show that providing in-context reasoning examples from simulated scenes improves performance on real-world and expert-curated questions, demonstrating the effectiveness of perceptual-taxonomy-guided prompting.

Paper Structure

This paper contains 33 sections, 9 figures, 4 tables.

Figures (9)

  • Figure 1: Illustration of our proposed Perceptual Taxonomy for structured scene understanding (a) A motivating example of human reasoning: when needing protection, a person hierarchically perceives the scene (scene-level context), identifies available objects (object-level), infers their physically grounded properties (e.g., flat, rigid, portable), and selects the most suitable one for the goal (book as shield). (b) An example from PercepTax where a model is asked to select an object that can be repurposed as a shield. The model must reason over object properties to make a functional choice, emulating human-like perception and decision-making.
  • Figure 2: Overview of the PercepTax benchmark. (a) Each scene is annotated with object detections and their mapped attributes across four domains: material, physical properties, affordance, and function. (b) Summary statistics for 84 attribute categories and question distributions across synthetic and real images. (c) Example questions covering four reasoning types: object description, spatial relation, property matching, and taxonomy reasoning, together forming a unified framework for perceptual and hierarchical scene understanding.
  • Figure 3: Object Anotation Pipeline: Objects are first collected from synthetic and real scenes, described by foundation models, and clustered by shared material, function, and affordance attributes with human verification at each stage. Template-based QA generation then converts verified clusters into question–answer pairs, with 68.3% passing the final quality check to ensure accurate and semantically grounded annotations.
  • Figure 4: Qualitative examples illustrating model errors in hierarchical reasoning. (a) Instances of hallucinated or inconsistent reasoning on simulated-image benchmark questions produced by Qwen3 and GPT-5. (b) Examples of Gemini’s reasoning on real-image benchmark questions before and after in-context learning.
  • Figure 7: Overview of the Real Image object annotation pipeline the blue blocks are intermediate outputs and the red blocks (object tags, Bounding Boxes, 3D poses are the pipeline outputs we use to further create the object attribute clusters and Benchmarks Questions).
  • ...and 4 more figures