Table of Contents
Fetching ...

Probing Perceptual Constancy in Large Vision-Language Models

Haoran Sun, Bingyang Wang, Suyang Yu, Yijiang Li, Qingying Gao, Haiyun Lyu, Hokin Deng, Dezhi Luo

TL;DR

This work tackles perceptual constancy in Vision-Language Models by introducing ConstancyBench, a large-scale benchmark spanning color, size, and shape constancy with 236 controlled experiments across 155 VLMs. The study employs zero-shot prompts and analyzes results with domain-wise statistics, model-size scaling, and a two-parameter logistic IRT framework to reveal the latent structure of constancy tasks. Findings show a clear hierarchy: shape constancy is robust even in smaller models, while color and size constancy improve with greater model capacity and multimodal integration, following a quantifiable scaling pattern. The results highlight a hierarchical, emergent pattern of perceptual invariance and propose ConstancyBench as a diagnostic tool for assessing world-modeling fidelity and guiding future architectural and training biases toward robust real-world perception.

Abstract

Perceptual constancy is the ability to maintain stable perceptions of objects despite changes in sensory input, such as variations in distance, angle, or lighting. This ability is crucial for visual understanding in a dynamic world. Here, we explored such ability in current Vision Language Models (VLMs). In this study, we evaluated 155 VLMs using 236 experiments across three domains: color, size, and shape constancy. The experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions. We found significant variability in VLM performance across these domains, with model performance in shape constancy clearly dissociated from that of color and size constancy.

Probing Perceptual Constancy in Large Vision-Language Models

TL;DR

This work tackles perceptual constancy in Vision-Language Models by introducing ConstancyBench, a large-scale benchmark spanning color, size, and shape constancy with 236 controlled experiments across 155 VLMs. The study employs zero-shot prompts and analyzes results with domain-wise statistics, model-size scaling, and a two-parameter logistic IRT framework to reveal the latent structure of constancy tasks. Findings show a clear hierarchy: shape constancy is robust even in smaller models, while color and size constancy improve with greater model capacity and multimodal integration, following a quantifiable scaling pattern. The results highlight a hierarchical, emergent pattern of perceptual invariance and propose ConstancyBench as a diagnostic tool for assessing world-modeling fidelity and guiding future architectural and training biases toward robust real-world perception.

Abstract

Perceptual constancy is the ability to maintain stable perceptions of objects despite changes in sensory input, such as variations in distance, angle, or lighting. This ability is crucial for visual understanding in a dynamic world. Here, we explored such ability in current Vision Language Models (VLMs). In this study, we evaluated 155 VLMs using 236 experiments across three domains: color, size, and shape constancy. The experiments included single-image and video adaptations of classic cognitive tasks, along with novel tasks in in-the-wild conditions. We found significant variability in VLM performance across these domains, with model performance in shape constancy clearly dissociated from that of color and size constancy.

Paper Structure

This paper contains 19 sections, 9 figures.

Figures (9)

  • Figure 1: Sample Tasks from the Three Evaluation Dimensions of ConstancyBench. Example model performance from GPT-4o is presented.
  • Figure 2: Bar plots show accuracy scores across vision--language models for color ($n=153$), shape ($n=152$), and size ($n=149$) constancy tasks after outlier removal (Color: $0.588 \pm 0.185$; Shape: $0.723 \pm 0.170$; Size: $0.584 \pm 0.123$). Horizontal bars with asterisks indicate statistical significance from post-hoc Tukey HSD tests (⁎ $p < 0.001$; ns = not significant, $p > 0.05$).
  • Figure 3: Relationship between model size and performance. Larger models tend to perform better in perceptual constancy tasks.
  • Figure 4: Item Response Theory (IRT) analysis of model-level performance. Each point represents a single item parameterized by discrimination ($a$) and difficulty ($b$). Shape-constancy items cluster in regions of lower difficulty and moderate discrimination, whereas color and size items show wider dispersion, reflecting greater variability in perceptual demands and model sensitivity.
  • Figure 7: Additional Sample Tasks: Color Constancy.
  • ...and 4 more figures