Table of Contents
Fetching ...

Evaluating Model Perception of Color Illusions in Photorealistic Scenes

Lingjun Mao, Zineng Tang, Alane Suhr

TL;DR

This work introduces RCID, a large-scale Realistic Color Illusion Dataset, to systematically evaluate color-illusion perception in vision-language systems. It presents an automated generation pipeline using ControlNet diffusion and procedural synthesis to create 19,000 photorealistic illusion images across contrast, stripe, and filter types, with human-validated labels and QA prompts. Through extensive experiments on open-source VLMs, the study demonstrates that models exhibit human-like perceptual biases on illusion content, influenced by prompting, fine-tuning, model size, and prior knowledge such as language and commonsense. The findings reveal the dual influence of the visual system and prior knowledge on VLMs and provide a practical baseline and guidelines for illusion-aware evaluation, with implications for safety and reliability in color-aware tasks.

Abstract

We study the perception of color illusions by vision-language models. Color illusion, where a person's visual system perceives color differently from actual color, is well-studied in human vision. However, it remains underexplored whether vision-language models (VLMs), trained on large-scale human data, exhibit similar perceptual biases when confronted with such color illusions. We propose an automated framework for generating color illusion images, resulting in RCID (Realistic Color Illusion Dataset), a dataset of 19,000 realistic illusion images. Our experiments show that all studied VLMs exhibit perceptual biases similar human vision. Finally, we train a model to distinguish both human perception and actual pixel differences.

Evaluating Model Perception of Color Illusions in Photorealistic Scenes

TL;DR

This work introduces RCID, a large-scale Realistic Color Illusion Dataset, to systematically evaluate color-illusion perception in vision-language systems. It presents an automated generation pipeline using ControlNet diffusion and procedural synthesis to create 19,000 photorealistic illusion images across contrast, stripe, and filter types, with human-validated labels and QA prompts. Through extensive experiments on open-source VLMs, the study demonstrates that models exhibit human-like perceptual biases on illusion content, influenced by prompting, fine-tuning, model size, and prior knowledge such as language and commonsense. The findings reveal the dual influence of the visual system and prior knowledge on VLMs and provide a practical baseline and guidelines for illusion-aware evaluation, with implications for safety and reliability in color-aware tasks.

Abstract

We study the perception of color illusions by vision-language models. Color illusion, where a person's visual system perceives color differently from actual color, is well-studied in human vision. However, it remains underexplored whether vision-language models (VLMs), trained on large-scale human data, exhibit similar perceptual biases when confronted with such color illusions. We propose an automated framework for generating color illusion images, resulting in RCID (Realistic Color Illusion Dataset), a dataset of 19,000 realistic illusion images. Our experiments show that all studied VLMs exhibit perceptual biases similar human vision. Finally, we train a model to distinguish both human perception and actual pixel differences.

Paper Structure

This paper contains 36 sections, 5 equations, 18 figures, 1 table.

Figures (18)

  • Figure 2: Process for generating our dataset.
  • Figure 3: Data statistics of RCID (Realistic Color Illusion Dataset).
  • Figure 4: This figure shows the proportion of different model responses across three types of illusions (Contrast, Filter, and Stripe) on our development set. For non-illusion images, we report the proportions of "Accurate" and "Wrong" responses. For illusion images, we categorize responses as "No Illusion" (consistent with pixel values), "Human like," and "N/A." Each image is evaluated using two types of prompts: one based on pixel values and the other based on human perception.
  • Figure 5: Deception rates of humans and VLMs across different structural patterns.
  • Figure 6: Proportions of 'No Illusion,' 'Human Like,' and 'N/A' responses for OFA models of different sizes on contrast illusion images.
  • ...and 13 more figures