Table of Contents
Fetching ...

Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

Daniel Geng, Inbum Park, Andrew Owens

TL;DR

This work introduces a zero-shot framework for generating multi-view optical illusions with off-the-shelf diffusion models by denoising multiple transformed views in parallel and aggregating their noise estimates to produce a unified reverse-diffusion update. It formalizes the concept of visual anagrams and extends from rotations and flips to arbitrary orthogonal pixel permutations, supported by a theoretical analysis of admissible views. The approach uses a pixel-based diffusion model to avoid latent-artifact issues and demonstrates quantitative gains in alignment and concealment over baselines, along with extensive qualitative results and ablations across up to four views. The work provides practical guidance on view design, highlights failure modes, and lays a foundation for broader exploration of diffusion-based perceptual illusions in a zero-shot regime.

Abstract

We address the problem of synthesizing multi-view optical illusions: images that change appearance upon a transformation, such as a flip or rotation. We propose a simple, zero-shot method for obtaining these illusions from off-the-shelf text-to-image diffusion models. During the reverse diffusion process, we estimate the noise from different views of a noisy image, and then combine these noise estimates together and denoise the image. A theoretical analysis suggests that this method works precisely for views that can be written as orthogonal transformations, of which permutations are a subset. This leads to the idea of a visual anagram--an image that changes appearance under some rearrangement of pixels. This includes rotations and flips, but also more exotic pixel permutations such as a jigsaw rearrangement. Our approach also naturally extends to illusions with more than two views. We provide both qualitative and quantitative results demonstrating the effectiveness and flexibility of our method. Please see our project webpage for additional visualizations and results: https://dangeng.github.io/visual_anagrams/

Visual Anagrams: Generating Multi-View Optical Illusions with Diffusion Models

TL;DR

This work introduces a zero-shot framework for generating multi-view optical illusions with off-the-shelf diffusion models by denoising multiple transformed views in parallel and aggregating their noise estimates to produce a unified reverse-diffusion update. It formalizes the concept of visual anagrams and extends from rotations and flips to arbitrary orthogonal pixel permutations, supported by a theoretical analysis of admissible views. The approach uses a pixel-based diffusion model to avoid latent-artifact issues and demonstrates quantitative gains in alignment and concealment over baselines, along with extensive qualitative results and ablations across up to four views. The work provides practical guidance on view design, highlights failure modes, and lays a foundation for broader exploration of diffusion-based perceptual illusions in a zero-shot regime.

Abstract

We address the problem of synthesizing multi-view optical illusions: images that change appearance upon a transformation, such as a flip or rotation. We propose a simple, zero-shot method for obtaining these illusions from off-the-shelf text-to-image diffusion models. During the reverse diffusion process, we estimate the noise from different views of a noisy image, and then combine these noise estimates together and denoise the image. A theoretical analysis suggests that this method works precisely for views that can be written as orthogonal transformations, of which permutations are a subset. This leads to the idea of a visual anagram--an image that changes appearance under some rearrangement of pixels. This includes rotations and flips, but also more exotic pixel permutations such as a jigsaw rearrangement. Our approach also naturally extends to illusions with more than two views. We provide both qualitative and quantitative results demonstrating the effectiveness and flexibility of our method. Please see our project webpage for additional visualizations and results: https://dangeng.github.io/visual_anagrams/
Paper Structure (51 sections, 12 equations, 18 figures, 2 tables)

This paper contains 51 sections, 12 equations, 18 figures, 2 tables.

Figures (18)

  • Figure 1: Generating Multi-View Illusions. We propose a method for generating optical illusions from an off-the-shelf text-to-image diffusion model. We create images that match different prompts after undergoing a transformation. Our approach supports a variety of transformations, including flips, rotations, skews, color inversions, and jigsaw rearrangements. All images are hand selected. For random samples, please see \ref{['fig:random']} and Appendix \ref{['sec:apdx_random']}. For easier viewing, please see our https://dangeng.github.io/visual_anagrams/ for animated versions of these illusions.
  • Figure 2: Algorithm Overview. Our method works by simultaneously denoising multiple views of an image. Given a noisy image ${\mathbf x}_t$, we compute noise estimates, $\epsilon_t^i$, conditioned on different prompts, after applying views $v_i$. We then apply the inverse view $v_i^{-1}$ to align estimates, average the estimates, and perform a reverse diffusion step. The final output is an optical illusion.
  • Figure 3: Latent-Based Artifacts. Manipulating the location of latent codes does not change the orientation of the blocks for which they encode. Therefore, when using latent diffusion models we see artifacts as shown above, in which straight lines are thatched under a rotation.
  • Figure 4: Flip View CLIP Score Distribution. We visualize trade-offs between flipped and unflipped views by plotting the distribution of CLIP scores on the datasets. Note that the quality of the flipped image is as good as the unflipped image, with parity indicated by the dashed line.
  • Figure 5: Qualitative Comparisons. We compare illusions generated by baselines to our illusions. We show examples from both our prompt dataset and the CIFAR prompt dataset.
  • ...and 13 more figures