Table of Contents
Fetching ...

The Art of Deception: Color Visual Illusions and Diffusion Models

Alex Gomez-Villa, Kai Wang, Alejandro C. Parraga, Bartlomiej Twardowski, Jesus Malo, Javier Vazquez-Corral, Joost van de Weijer

TL;DR

This work investigates why visual illusions arise in both humans and diffusion models by studying DDIM inversion trajectories, showing that intermediate latents undergo human-like brightness and color shifts. It develops a region-targeted VI generation pipeline for text-to-image diffusion models, guided by a perceptual loss and a region-compatibility term, and validates the approach with extensive datasets and psychophysical experiments. Key contributions include (i) empirical replication of brightness/color illusions in diffusion models across VI datasets, (ii) a method to generate novel VIs within natural images with region-specific control, and (iii) psychophysical confirmation that model-generated illusions can fool human observers, outperforming classical baselines. The results suggest diffusion processes encode perceptual statistics akin to human vision, with practical implications for perceptually informed image editing and more robust, human-aligned vision-language systems.

Abstract

Visual illusions in humans arise when interpreting out-of-distribution stimuli: if the observer is adapted to certain statistics, perception of outliers deviates from reality. Recent studies have shown that artificial neural networks (ANNs) can also be deceived by visual illusions. This revelation raises profound questions about the nature of visual information. Why are two independent systems, both human brains and ANNs, susceptible to the same illusions? Should any ANN be capable of perceiving visual illusions? Are these perceptions a feature or a flaw? In this work, we study how visual illusions are encoded in diffusion models. Remarkably, we show that they present human-like brightness/color shifts in their latent space. We use this fact to demonstrate that diffusion models can predict visual illusions. Furthermore, we also show how to generate new unseen visual illusions in realistic images using text-to-image diffusion models. We validate this ability through psychophysical experiments that show how our model-generated illusions also fool humans.

The Art of Deception: Color Visual Illusions and Diffusion Models

TL;DR

This work investigates why visual illusions arise in both humans and diffusion models by studying DDIM inversion trajectories, showing that intermediate latents undergo human-like brightness and color shifts. It develops a region-targeted VI generation pipeline for text-to-image diffusion models, guided by a perceptual loss and a region-compatibility term, and validates the approach with extensive datasets and psychophysical experiments. Key contributions include (i) empirical replication of brightness/color illusions in diffusion models across VI datasets, (ii) a method to generate novel VIs within natural images with region-specific control, and (iii) psychophysical confirmation that model-generated illusions can fool human observers, outperforming classical baselines. The results suggest diffusion processes encode perceptual statistics akin to human vision, with practical implications for perceptually informed image editing and more robust, human-aligned vision-language systems.

Abstract

Visual illusions in humans arise when interpreting out-of-distribution stimuli: if the observer is adapted to certain statistics, perception of outliers deviates from reality. Recent studies have shown that artificial neural networks (ANNs) can also be deceived by visual illusions. This revelation raises profound questions about the nature of visual information. Why are two independent systems, both human brains and ANNs, susceptible to the same illusions? Should any ANN be capable of perceiving visual illusions? Are these perceptions a feature or a flaw? In this work, we study how visual illusions are encoded in diffusion models. Remarkably, we show that they present human-like brightness/color shifts in their latent space. We use this fact to demonstrate that diffusion models can predict visual illusions. Furthermore, we also show how to generate new unseen visual illusions in realistic images using text-to-image diffusion models. We validate this ability through psychophysical experiments that show how our model-generated illusions also fool humans.

Paper Structure

This paper contains 22 sections, 5 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Observation: Denoising Diffusion Implicit Models (DDIM) have human-like visual illusions. Application: text-to-image generated visual illusions. Examples (a) and (b) show how the responses to physically equal patches shift differently in the path to the latent space. In contrast, (c) shows how physically different stimuli get more similar along the path, achieving color constancy. The observation of this human-like behavior allows us to propose the use of text-to-image models to generate images in which physically identical patches are perceived differently (examples d, e, f). We highly recommend to watch the illusions on a computer screen.
  • Figure 2: DDIM inversion of the Brightness Contrast illusion bruke using Stable Diffusion. Top row: Image-space visualization (decoded latents) showing (left) Original illusion with two identical gray squares (marked in red) against black and white backgrounds, and inversion results using 3, 10, and 20 steps. Bottom row: Histograms of the corresponding latent representations. The model gradually reproduces the perceptual difference in brightness between the physically identical squares in a (not fully Gaussian) intermediate representation.
  • Figure 3: Overview of our visual illusion generation pipeline. The process modifies noisy latent representations, $z_t$, through a custom loss function, $\mathcal{L}$, that guides the generation toward perceptually ambiguous outputs.
  • Figure 4: Qualitative replication of visual illusions (please enlarge display): For each item, the original image is on the left and its DDIM inversion on the right. The observation done in Fig. \ref{['fig:main']} is consistently reproduced. We zoom in on the region containing the illusion in each image pair. a) Barutan-seijin barutan, b) Robot robot, c) Shiosai shiosai, d) Herman-grid hermann1870erscheinung, e) Grating induction mccourt1994grating, f) Bright room natural, g) Confetti illusion confetti. For more qualitative examples and results of classical visual illusions, see the supplementary material.
  • Figure 5: Qualitative results of visual illusion generation (viewing at larger scale recommended). The first row presents prior work by (a) Gomez-Villa et al. gomez2022synthesis, (b) Hirsch et al. hirsch2020color, and (c) Roy et al. roy2024bri3l. The second and third rows display results from our method (text prompts are available in the supplementary material). Images (d) through (i) use identical target colors (shown at the left of each image). In images (j) and (k), we employ grayscale gradients as targets. The selected target regions are highlighted in red in the thumbnail images.