Table of Contents
Fetching ...

Move Anything with Layered Scene Diffusion

Jiawei Ren, Mengmeng Xu, Jui-Chieh Wu, Ziwei Liu, Tao Xiang, Antoine Toisoul

TL;DR

SceneDiffusion presents a training-free, diffusion-based framework for controllable scene generation by optimizing a layered scene representation during sampling. By denoising multiple randomly sampled layouts in parallel and solving a closed-form update, the method achieves spatial disentanglement between layout and appearance, enabling moving, resizing, and cloning of objects, as well as in-the-wild editing guided by a reference image. A short neural rendering phase renders the final image from the optimized layers, balancing fidelity to masks with image quality. Quantitative and qualitative results show state-of-the-art performance on scene generation and editing tasks, with interactive speeds suitable for real-time editing and broad compatibility with standard diffusion models.

Abstract

Diffusion models generate images with an unprecedented level of quality, but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes, but these methods do not apply to diffusion models due to their fixed forward process. In this work, we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations, including object restyling and replacing. Moreover, a scene can be generated conditioned on a reference image, thus enabling object moving for in-the-wild images. Notably, this approach is training-free, compatible with general text-to-image diffusion models, and responsive in less than a second.

Move Anything with Layered Scene Diffusion

TL;DR

SceneDiffusion presents a training-free, diffusion-based framework for controllable scene generation by optimizing a layered scene representation during sampling. By denoising multiple randomly sampled layouts in parallel and solving a closed-form update, the method achieves spatial disentanglement between layout and appearance, enabling moving, resizing, and cloning of objects, as well as in-the-wild editing guided by a reference image. A short neural rendering phase renders the final image from the optimized layers, balancing fidelity to masks with image quality. Quantitative and qualitative results show state-of-the-art performance on scene generation and editing tasks, with interactive speeds suitable for real-time editing and broad compatibility with standard diffusion models.

Abstract

Diffusion models generate images with an unprecedented level of quality, but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes, but these methods do not apply to diffusion models due to their fixed forward process. In this work, we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations, including object restyling and replacing. Moreover, a scene can be generated conditioned on a reference image, thus enabling object moving for in-the-wild images. Notably, this approach is training-free, compatible with general text-to-image diffusion models, and responsive in less than a second.
Paper Structure (55 sections, 12 equations, 16 figures, 8 tables)

This paper contains 55 sections, 12 equations, 16 figures, 8 tables.

Figures (16)

  • Figure 1: Move anything on an image.Top: our approach generates playable scenes: objects are spatially disentangled, thus can be freely moved, resized, and cloned in the scene. Bottom: a scene can be generated conditioned on a reference image, thus supporting extensive spatial image editing operations. Our approach is training-free and compatible with general text-to-image diffusion models. Once optimized, rendering a new layout requires less than a second on a single GPU, allowing interactive interactions.
  • Figure 2: Method overview. Our framework has two stages: i) optimization stage, we optimize a layered scene representation with SceneDiffusion for $T-\tau$ diffusion steps, and ii) inference stage, we render the optimized layered scene with $\tau$-step standard image diffusion. iii) SceneDiffusion updates the layered scene by denoising multiple randomly sampled layouts in parallel. In the illustration, the scene has 4 layers. Each layer consists of a feature map $f$, a mask $m$ (shown as a box), and a text prompt $y$ (shown at the bottom). At denoising step $t$, we randomly sample $N$ layouts and render them to get different views $v^{(t)}$. We then denoise the views using a pretrained T2I diffusion model for one step to get $\hat{v}^{(t-1)}$, which are used to update the feature maps $f^{(t)} \to f^{(t-1)}$ in the layered scene. Note that boxes here only serve as a rough geometry of objects (like blobs in epstein2022blobgan), and can be replaced by more accurate masks.
  • Figure 3: Sequential manipulations. Our generated scenes can be manipulated by operating on layers sequentially.
  • Figure 4: Object moving. Our approach can be employed to move objects on a given image. Edited objects are shown in bold in the prompts. Examples are borrowed from epstein2023diffusion and no access to the initial latent noise is assumed. All layouts for each example are generated from the same scene. As a result, our approach keeps the overall content consistent across different editings, which most prior works fail to achieve. A full comparison with prior works can be found in appendix.
  • Figure 5: Restyling objects. Adding style description to the layer prompt restyles the object when fixing the initial noise. The circular arrow shows the restyled object.
  • ...and 11 more figures