Move Anything with Layered Scene Diffusion
Jiawei Ren, Mengmeng Xu, Jui-Chieh Wu, Ziwei Liu, Tao Xiang, Antoine Toisoul
TL;DR
SceneDiffusion presents a training-free, diffusion-based framework for controllable scene generation by optimizing a layered scene representation during sampling. By denoising multiple randomly sampled layouts in parallel and solving a closed-form update, the method achieves spatial disentanglement between layout and appearance, enabling moving, resizing, and cloning of objects, as well as in-the-wild editing guided by a reference image. A short neural rendering phase renders the final image from the optimized layers, balancing fidelity to masks with image quality. Quantitative and qualitative results show state-of-the-art performance on scene generation and editing tasks, with interactive speeds suitable for real-time editing and broad compatibility with standard diffusion models.
Abstract
Diffusion models generate images with an unprecedented level of quality, but how can we freely rearrange image layouts? Recent works generate controllable scenes via learning spatially disentangled latent codes, but these methods do not apply to diffusion models due to their fixed forward process. In this work, we propose SceneDiffusion to optimize a layered scene representation during the diffusion sampling process. Our key insight is that spatial disentanglement can be obtained by jointly denoising scene renderings at different spatial layouts. Our generated scenes support a wide range of spatial editing operations, including moving, resizing, cloning, and layer-wise appearance editing operations, including object restyling and replacing. Moreover, a scene can be generated conditioned on a reference image, thus enabling object moving for in-the-wild images. Notably, this approach is training-free, compatible with general text-to-image diffusion models, and responsive in less than a second.
