Table of Contents
Fetching ...

Guided Image Synthesis via Initial Image Editing in Diffusion Model

Jiafeng Mao, Xueting Wang, Kiyoharu Aizawa

TL;DR

This work shows that blocks of pixels in the initial latent images have a preference for generating specific content, and that modifying these blocks can significantly influence the generated image, and finds that the generation preferences of pixel blocks are primarily determined by their values, rather than their position.

Abstract

Diffusion models have the ability to generate high quality images by denoising pure Gaussian noise images. While previous research has primarily focused on improving the control of image generation through adjusting the denoising process, we propose a novel direction of manipulating the initial noise to control the generated image. Through experiments on stable diffusion, we show that blocks of pixels in the initial latent images have a preference for generating specific content, and that modifying these blocks can significantly influence the generated image. In particular, we show that modifying a part of the initial image affects the corresponding region of the generated image while leaving other regions unaffected, which is useful for repainting tasks. Furthermore, we find that the generation preferences of pixel blocks are primarily determined by their values, rather than their position. By moving pixel blocks with a tendency to generate user-desired content to user-specified regions, our approach achieves state-of-the-art performance in layout-to-image generation. Our results highlight the flexibility and power of initial image manipulation in controlling the generated image. Project Page: https://ut-mao.github.io/swap.github.io/

Guided Image Synthesis via Initial Image Editing in Diffusion Model

TL;DR

This work shows that blocks of pixels in the initial latent images have a preference for generating specific content, and that modifying these blocks can significantly influence the generated image, and finds that the generation preferences of pixel blocks are primarily determined by their values, rather than their position.

Abstract

Diffusion models have the ability to generate high quality images by denoising pure Gaussian noise images. While previous research has primarily focused on improving the control of image generation through adjusting the denoising process, we propose a novel direction of manipulating the initial noise to control the generated image. Through experiments on stable diffusion, we show that blocks of pixels in the initial latent images have a preference for generating specific content, and that modifying these blocks can significantly influence the generated image. In particular, we show that modifying a part of the initial image affects the corresponding region of the generated image while leaving other regions unaffected, which is useful for repainting tasks. Furthermore, we find that the generation preferences of pixel blocks are primarily determined by their values, rather than their position. By moving pixel blocks with a tendency to generate user-desired content to user-specified regions, our approach achieves state-of-the-art performance in layout-to-image generation. Our results highlight the flexibility and power of initial image manipulation in controlling the generated image. Project Page: https://ut-mao.github.io/swap.github.io/
Paper Structure (16 sections, 4 equations, 6 figures, 1 table)

This paper contains 16 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Our analysis experiments focused on the generation tendency of random initial noise Image. We present samples of generation results using two different initial images, both of which are randomly sampled from Gaussian distribution. Images with the same color borders are generated from the same initial image. When the initial images are the same, the same categories under different prompts are highly similar in position and visual appearance. When the randomly sampled initial images differ, the generated results are highly different in layout and details, even when the same prompts are used.
  • Figure 2: Our experiments on initial noise image editing. We first perform generation using random noise initial image sampled from Gaussian noise. Although both the 'cat' and the 'chair' are generated in the first generated image, the location of the cat is not satisfying to the prompt. Since the region where the cat is generated (bounding box A) that tends to generate 'cat' is not expected, we re-randomize the pixel values of the corresponding area in the initial image. We also re-randomize the region near the chair (bounding box B) because it did not tend to generate 'cat'. The improved image on the right side is generated by the modified initial image and the same prompt after a few attempts.
  • Figure 3: Samples of re-painting experiments. Images on the left side are generated from the random initial noise, while images on the right side are generated from partially re-sampled initial noise. The first prompt involves two objects, but three were generated. After re-sampling the pixel values in the region corresponding to the third object in the initial image, the third object was removed. The second generation failed to generate the billboard's content (which should be a sheep). After re-sampling the pixel values in the region corresponding to the billboard, we obtained images with the same layout, style, and correctly generated billboard. The horse in the third generated image has three forelegs and unnatural hind legs. After re-sampling these two parts, we obtained generated images with different and better horses while keeping the overall image unchanged. It takes 4-5 attempts on average to obtain each modified sample.
  • Figure 4: Concept of the pixel blocks swapping experiment. We use the attention map of the initial noise image to indicate the initial generation tendency. Subsequently, we move the pixel blocks that tend to generate specific content into specified regions, and the modified noise image is used to perform denoising as usual.
  • Figure 5: Attention maps after swapping and attention map calculated by modified initial noise $z^\prime$. Aggregated attention shows the swapped result of the original attention map. $z^\prime$ is obtained by swapping pixel blocks with high attention values into specified regions. The attention map of $z^\prime$ shows that the regions where high-attention pixel blocks are aggregated carry a high tendency to generate the corresponding content.
  • ...and 1 more figures