Table of Contents
Fetching ...

Factorized Diffusion: Perceptual Illusions by Noise Decomposition

Daniel Geng, Inbum Park, Andrew Owens

TL;DR

This work introduces Factorized Diffusion, a zero-shot framework that controls individual components of an image during diffusion sampling by conditioning each component of a decomposition on separate prompts. It enables a family of perceptual illusions, including hybrid images that change with viewing distance, color hybrids that differ in grayscale versus color, and motion hybrids that respond to blur, by constructing a composite noise estimate from per-component noise estimates. The method covers multiple decompositions (frequency subbands, color spaces, motion, spatial masks, and scaling) and provides a theoretical view of why per-component conditioning preserves independence under linear updates, with extensions to inverse problems that generate hybrids from real images. Empirically, it yields high-quality hybrids and outperforms some prior approaches in realism and prompt alignment, while also addressing limitations and societal considerations of perceptual deception.

Abstract

Given a factorization of an image into a sum of linear components, we present a zero-shot method to control each individual component through diffusion model sampling. For example, we can decompose an image into low and high spatial frequencies and condition these components on different text prompts. This produces hybrid images, which change appearance depending on viewing distance. By decomposing an image into three frequency subbands, we can generate hybrid images with three prompts. We also use a decomposition into grayscale and color components to produce images whose appearance changes when they are viewed in grayscale, a phenomena that naturally occurs under dim lighting. And we explore a decomposition by a motion blur kernel, which produces images that change appearance under motion blurring. Our method works by denoising with a composite noise estimate, built from the components of noise estimates conditioned on different prompts. We also show that for certain decompositions, our method recovers prior approaches to compositional generation and spatial control. Finally, we show that we can extend our approach to generate hybrid images from real images. We do this by holding one component fixed and generating the remaining components, effectively solving an inverse problem.

Factorized Diffusion: Perceptual Illusions by Noise Decomposition

TL;DR

This work introduces Factorized Diffusion, a zero-shot framework that controls individual components of an image during diffusion sampling by conditioning each component of a decomposition on separate prompts. It enables a family of perceptual illusions, including hybrid images that change with viewing distance, color hybrids that differ in grayscale versus color, and motion hybrids that respond to blur, by constructing a composite noise estimate from per-component noise estimates. The method covers multiple decompositions (frequency subbands, color spaces, motion, spatial masks, and scaling) and provides a theoretical view of why per-component conditioning preserves independence under linear updates, with extensions to inverse problems that generate hybrids from real images. Empirically, it yields high-quality hybrids and outperforms some prior approaches in realism and prompt alignment, while also addressing limitations and societal considerations of perceptual deception.

Abstract

Given a factorization of an image into a sum of linear components, we present a zero-shot method to control each individual component through diffusion model sampling. For example, we can decompose an image into low and high spatial frequencies and condition these components on different text prompts. This produces hybrid images, which change appearance depending on viewing distance. By decomposing an image into three frequency subbands, we can generate hybrid images with three prompts. We also use a decomposition into grayscale and color components to produce images whose appearance changes when they are viewed in grayscale, a phenomena that naturally occurs under dim lighting. And we explore a decomposition by a motion blur kernel, which produces images that change appearance under motion blurring. Our method works by denoising with a composite noise estimate, built from the components of noise estimates conditioned on different prompts. We also show that for certain decompositions, our method recovers prior approaches to compositional generation and spatial control. Finally, we show that we can extend our approach to generate hybrid images from real images. We do this by holding one component fixed and generating the remaining components, effectively solving an inverse problem.
Paper Structure (42 sections, 16 equations, 20 figures, 3 tables)

This paper contains 42 sections, 16 equations, 20 figures, 3 tables.

Figures (20)

  • Figure 1: Illusions by Factorized Diffusion. By conditioning the components of a generated image with different prompts, we can use off-the-shelf text-conditioned image diffusion models to synthesize hybrid images oliva2006hybrid, hybrid images containing three objects, and new perceptual illusions which we refer to as color hybrids and motion hybrids, which change appearance when color is added or motion blur is induced. In addition, we can extract a component from an existing image and generate the missing components, allowing us to produce hybrid images from real images, which we term inverse hybrids. Examples shown are hand-picked. For random samples please see \ref{['fig:random']} and \ref{['fig:sup_random']}. For the hybrid images, we include insets to aid in visualization. However, perception of this effect depends on the resolution of the images, so we highly encourage the reader to zoom so that an image fills the screen completely, or visit our https://dangeng.github.io/factorized_diffusion/ for easier viewing.
  • Figure 2: Factorized Diffusion. Given an image decomposition, we control components of the decomposition through text conditioning during image generation. To do this, we modify the sampling procedure of a pretrained diffusion model. Specifically, at each denoising step, $t$, we construct a new noise estimate, $\tilde{\epsilon}$, to use for denoising, whose components come from components of $\epsilon_i$, which are noise estimates conditioned on different prompts. Here, we show a decomposition into three frequency subbands, used for creating triple hybrid images, but we consider a number of other decompositions.
  • Figure 3: Effect of $\sigma$. We show a linear sweep over the $\sigma$ value used in our hybrid decomposition. A lower $\sigma$ results in the low pass prompt being more prominent, and vice-versa. In between lies hybrid images. Best viewed digitally, with zoom.
  • Figure 4: Comparison to Oliva et al. oliva2006hybrid. We take hybrid images from Oliva et al. oliva2006hybrid, and generate our own versions. Left is from our method, and right is from Oliva et al.'s. As can be seen, our method produces much more realistic images while still containing both subjects. Best viewed digitally, with zoom.
  • Figure 5: Color Hybrids. We show additional color hybrid results. These are images that change appearance when color is added or subtracted away. These images change appearance when moved from bright to dim lighting, in which color is harder to see.
  • ...and 15 more figures