Perceptual Taxonomy: Evaluating and Guiding Hierarchical Scene Reasoning in Vision-Language Models
Jonathan Lee, Xingrui Wang, Jiawei Peng, Luoxin Ye, Zehan Zheng, Tiezheng Zhang, Tao Wang, Wufei Ma, Siyi Chen, Yu-Cheng Chou, Prakhar Kaushik, Alan Yuille
TL;DR
This work introduces Perceptual Taxonomy (PercepTax), a benchmark and framework for evaluating vision-language models on physically grounded, hierarchical scene understanding. It defines a three-level representation (scene, objects, properties) and a taxonomy spanning material, affordance, function, and physical properties across 3,173 objects, annotated in 4,544 synthetic and 1,258 real images, yielding 28,033 template-based questions plus 50 expert questions. Evaluation across state-of-the-art closed- and open-source VLMs reveals strong object recognition but notable weaknesses in property-level and taxonomy reasoning, with performance gaps that widen in real-world images. An in-context, taxonomic reasoning paradigm using simulated exemplars improves real-image performance, demonstrating sim-to-real transfer and highlighting the need for structure-aware objectives to achieve robust physically grounded reasoning in VLMs.
Abstract
We propose Perceptual Taxonomy, a structured process of scene understanding that first recognizes objects and their spatial configurations, then infers task-relevant properties such as material, affordance, function, and physical attributes to support goal-directed reasoning. While this form of reasoning is fundamental to human cognition, current vision-language benchmarks lack comprehensive evaluation of this ability and instead focus on surface-level recognition or image-text alignment. To address this gap, we introduce Perceptual Taxonomy, a benchmark for physically grounded visual reasoning. We annotate 3173 objects with four property families covering 84 fine-grained attributes. Using these annotations, we construct a multiple-choice question benchmark with 5802 images across both synthetic and real domains. The benchmark contains 28033 template-based questions spanning four types (object description, spatial reasoning, property matching, and taxonomy reasoning), along with 50 expert-crafted questions designed to evaluate models across the full spectrum of perceptual taxonomy reasoning. Experimental results show that leading vision-language models perform well on recognition tasks but degrade by 10 to 20 percent on property-driven questions, especially those requiring multi-step reasoning over structured attributes. These findings highlight a persistent gap in structured visual understanding and the limitations of current models that rely heavily on pattern matching. We also show that providing in-context reasoning examples from simulated scenes improves performance on real-world and expert-curated questions, demonstrating the effectiveness of perceptual-taxonomy-guided prompting.
