LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation
Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, Xi Li
TL;DR
LayoutDiffusion proposes a one-stage layout-conditioned diffusion model that treats structural image patches as a unified layout to fuse image and layout information. It introduces a Layout Fusion Module (LFM) to capture inter-object relationships and an Object-aware Cross Attention (OaCA) mechanism for precise local conditioning in a unified coordinate space. The approach achieves state-of-the-art results on COCO-Stuff and Visual Genome across FID, CAS, and YOLO-based metrics, while enabling interactive layout edits and faster conditional sampling via a DPM-Solver-based pipeline. By transforming multimodal fusion into unified-space fusion and leveraging classifier-free guidance, LayoutDiffusion offers strong controllability and high-quality generation for complex multi-object scenes with practical impacts in scene synthesis and content creation.
Abstract
Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at https://github.com/ZGCTroy/LayoutDiffusion.
