Table of Contents
Fetching ...

The Illusion-Illusion: Vision Language Models See Illusions Where There are None

Tomer Ullman

TL;DR

The paper investigates whether current vision-language models exhibit illusion-like perceptual errors by exposing them to illusion-illusion stimuli—images that appear illusory to some systems but are veridical to humans—and nearby controls. Using 10 illusion types with illusion-illusion variants and controls, it evaluates multiple multimodal models with binary scoring reflecting alignment with human perception. The results indicate that no model reproduces human-like performance, and even leading models often misclassify illusion-illusions as real illusions, highlighting a gap in cross-modal perception and interpretation. The work emphasizes limitations of current evaluations and argues for more robust, nuanced diagnostics, with potential extensions to other modalities.

Abstract

Illusions are entertaining, but they are also a useful diagnostic tool in cognitive science, philosophy, and neuroscience. A typical illusion shows a gap between how something "really is" and how something "appears to be", and this gap helps us understand the mental processing that lead to how something appears to be. Illusions are also useful for investigating artificial systems, and much research has examined whether computational models of perceptions fall prey to the same illusions as people. Here, I invert the standard use of perceptual illusions to examine basic processing errors in current vision language models. I present these models with illusory-illusions, neighbors of common illusions that should not elicit processing errors. These include such things as perfectly reasonable ducks, crooked lines that truly are crooked, circles that seem to have different sizes because they are, in fact, of different sizes, and so on. I show that many current vision language systems mistakenly see these illusion-illusions as illusions. I suggest that such failures are part of broader failures already discussed in the literature.

The Illusion-Illusion: Vision Language Models See Illusions Where There are None

TL;DR

The paper investigates whether current vision-language models exhibit illusion-like perceptual errors by exposing them to illusion-illusion stimuli—images that appear illusory to some systems but are veridical to humans—and nearby controls. Using 10 illusion types with illusion-illusion variants and controls, it evaluates multiple multimodal models with binary scoring reflecting alignment with human perception. The results indicate that no model reproduces human-like performance, and even leading models often misclassify illusion-illusions as real illusions, highlighting a gap in cross-modal perception and interpretation. The work emphasizes limitations of current evaluations and argues for more robust, nuanced diagnostics, with potential extensions to other modalities.

Abstract

Illusions are entertaining, but they are also a useful diagnostic tool in cognitive science, philosophy, and neuroscience. A typical illusion shows a gap between how something "really is" and how something "appears to be", and this gap helps us understand the mental processing that lead to how something appears to be. Illusions are also useful for investigating artificial systems, and much research has examined whether computational models of perceptions fall prey to the same illusions as people. Here, I invert the standard use of perceptual illusions to examine basic processing errors in current vision language models. I present these models with illusory-illusions, neighbors of common illusions that should not elicit processing errors. These include such things as perfectly reasonable ducks, crooked lines that truly are crooked, circles that seem to have different sizes because they are, in fact, of different sizes, and so on. I show that many current vision language systems mistakenly see these illusion-illusions as illusions. I suggest that such failures are part of broader failures already discussed in the literature.

Paper Structure

This paper contains 5 sections, 5 figures.

Figures (5)

  • Figure 1: Examples of illusions paired with illusion-illusions, and actual input-output pairs of several current models.
  • Figure 2: The stimuli used in the experiments, together with their base prompt.
  • Figure 3: CAPTION
  • Figure 4: Results of evaluations. The top panel shows the results of model runs on base prompts. The bottom panel shows results for amended prompts.
  • Figure 5: Examples of failures on control images, with models framing them as illusions.