Table of Contents
Fetching ...

Making Images from Images: Interleaving Denoising and Transformation

Shumeet Baluja, David Marwood, Ashwin Baluja

TL;DR

This method extends and improves recent work in the generation of optical illusions by simultaneously learning not only the content of the images, but also the parameterized transformations required to transform the desired images into each other.

Abstract

Simply by rearranging the regions of an image, we can create a new image of any subject matter. The definition of regions is user definable, ranging from regularly and irregularly-shaped blocks, concentric rings, or even individual pixels. Our method extends and improves recent work in the generation of optical illusions by simultaneously learning not only the content of the images, but also the parameterized transformations required to transform the desired images into each other. By learning the image transforms, we allow any source image to be pre-specified; any existing image (e.g. the Mona Lisa) can be transformed to a novel subject. We formulate this process as a constrained optimization problem and address it through interleaving the steps of image diffusion with an energy minimization step. Unlike previous methods, increasing the number of regions actually makes the problem easier and improves results. We demonstrate our approach in both pixel and latent spaces. Creative extensions, such as using infinite copies of the source image and employing multiple source images, are also given.

Making Images from Images: Interleaving Denoising and Transformation

TL;DR

This method extends and improves recent work in the generation of optical illusions by simultaneously learning not only the content of the images, but also the parameterized transformations required to transform the desired images into each other.

Abstract

Simply by rearranging the regions of an image, we can create a new image of any subject matter. The definition of regions is user definable, ranging from regularly and irregularly-shaped blocks, concentric rings, or even individual pixels. Our method extends and improves recent work in the generation of optical illusions by simultaneously learning not only the content of the images, but also the parameterized transformations required to transform the desired images into each other. By learning the image transforms, we allow any source image to be pre-specified; any existing image (e.g. the Mona Lisa) can be transformed to a novel subject. We formulate this process as a constrained optimization problem and address it through interleaving the steps of image diffusion with an energy minimization step. Unlike previous methods, increasing the number of regions actually makes the problem easier and improves results. We demonstrate our approach in both pixel and latent spaces. Creative extensions, such as using infinite copies of the source image and employing multiple source images, are also given.

Paper Structure

This paper contains 11 sections, 3 equations, 17 figures.

Figures (17)

  • Figure 1: Classic examples of optical illusions. (A) G.Arcimboldo's Fruit Basket (1590) that shows a face when upright, and a fruit basket when upside-down. (B) Depending on the orientation, this image appears either as a duck or a rabbit.
  • Figure 2: Through simple tile permutations, a source image can be converted to a new image of any subject matter. Both the permutation and the content are learned simultaneously; the images created are suited to the tiles available for the composition. Examples with three famous paintings are shown. Each is converted into 3 different subjects (two results are shown for each). The number of tiles that the source is divided into is $64\times64$, $32\times32$ and $16\times16$ (top to bottom). With our approach, as the number of tiles grows, the easier it is our for system to produce compelling results; the opposite is true for state-of-the-art alternate systems.
  • Figure 3: A description of a Visual Anagrams step, adapted from geng2023visualanagrams, using $N=2$. A single image that appears as a bowl of fruit can be subdivided into 4x4 square tiles and rearranged into a "smiley face" emoji. To create this, two transforms of the same image (here $\psi^1$=identity and $\psi^2$ is a permutation of $4\times4$ tiles) are denoised simultaneously. Two diffusion processes use different prompts to create their per-pixel classifier-free guidance (CFG), represented here as an image. The CFGs are passed through their respective inverse transforms into a shared space and averaged before being applied to input $x_t$. Both images contribute to the guidance and remain synchronized.
  • Figure 4: In each pair, each image is 192$\times$192 pixels. Each image can be subdivided into a grid of $8\times8$ (64 total blocks) of 24$\times$24 pixels. For each pair of images, there is a permutation of the 64 blocks that transforms the left to right and vice-versa. As in geng2023visualanagrams, the permutation is pre-specified and chosen randomly. To show geng2023visualanagrams in the best light, these are the top performing, as judged by their CLIP scores (distance to prompt) radford2021learning. All prompts were preceded by "A painting of".
  • Figure 5: Incorporating dynamic matching between each diffusion step. In the green rectangle, the input image is permuted and the rollout occurs. In the orange rectangle, the images created at the end of the rollout are used to compute the new permutations through dynamic matching.
  • ...and 12 more figures