Probing Perceptual Constancy in Large Vision-Language Models
Haoran Sun, Bingyang Wang, Suyang Yu, Yijiang Li, Qingying Gao, Haiyun Lyu, Hokin Deng, Dezhi Luo
TL;DR
This work tackles perceptual constancy in Vision-Language Models by introducing ConstancyBench, a large-scale benchmark spanning color, size, and shape constancy with 236 controlled experiments across 155 VLMs. The study employs zero-shot prompts and analyzes results with domain-wise statistics, model-size scaling, and a two-parameter logistic IRT framework to reveal the latent structure of constancy tasks. Findings show a clear hierarchy: shape constancy is robust even in smaller models, while color and size constancy improve with greater model capacity and multimodal integration, following a quantifiable scaling pattern. The results highlight a hierarchical, emergent pattern of perceptual invariance and propose ConstancyBench as a diagnostic tool for assessing world-modeling fidelity and guiding future architectural and training biases toward robust real-world perception.
Abstract
Perceptual constancy is the ability to maintain stable perceptions of objects despite changes in sensory input, such as variations in distance, angle, or lighting. This ability is crucial for visual understanding in a dynamic world. Here, we explored such ability in current Vision Language Models (VLMs). In this study, we evaluated 155 VLMs using 236 experiments across three domains: color, size, and shape constancy. The experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions. We found significant variability in VLM performance across these domains, with model performance in shape constancy clearly dissociated from that of color and size constancy.
