Table of Contents
Fetching ...

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

Guangcong Zheng, Xianpan Zhou, Xuewei Li, Zhongang Qi, Ying Shan, Xi Li

TL;DR

LayoutDiffusion proposes a one-stage layout-conditioned diffusion model that treats structural image patches as a unified layout to fuse image and layout information. It introduces a Layout Fusion Module (LFM) to capture inter-object relationships and an Object-aware Cross Attention (OaCA) mechanism for precise local conditioning in a unified coordinate space. The approach achieves state-of-the-art results on COCO-Stuff and Visual Genome across FID, CAS, and YOLO-based metrics, while enabling interactive layout edits and faster conditional sampling via a DPM-Solver-based pipeline. By transforming multimodal fusion into unified-space fusion and leveraging classifier-free guidance, LayoutDiffusion offers strong controllability and high-quality generation for complex multi-object scenes with practical impacts in scene synthesis and content creation.

Abstract

Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at https://github.com/ZGCTroy/LayoutDiffusion.

LayoutDiffusion: Controllable Diffusion Model for Layout-to-image Generation

TL;DR

LayoutDiffusion proposes a one-stage layout-conditioned diffusion model that treats structural image patches as a unified layout to fuse image and layout information. It introduces a Layout Fusion Module (LFM) to capture inter-object relationships and an Object-aware Cross Attention (OaCA) mechanism for precise local conditioning in a unified coordinate space. The approach achieves state-of-the-art results on COCO-Stuff and Visual Genome across FID, CAS, and YOLO-based metrics, while enabling interactive layout edits and faster conditional sampling via a DPM-Solver-based pipeline. By transforming multimodal fusion into unified-space fusion and leveraging classifier-free guidance, LayoutDiffusion offers strong controllability and high-quality generation for complex multi-object scenes with practical impacts in scene synthesis and content creation.

Abstract

Recently, diffusion models have achieved great success in image synthesis. However, when it comes to the layout-to-image generation where an image often has a complex scene of multiple objects, how to make strong control over both the global layout map and each detailed object remains a challenging task. In this paper, we propose a diffusion model named LayoutDiffusion that can obtain higher generation quality and greater controllability than the previous works. To overcome the difficult multimodal fusion of image and layout, we propose to construct a structural image patch with region information and transform the patched image into a special layout to fuse with the normal layout in a unified form. Moreover, Layout Fusion Module (LFM) and Object-aware Cross Attention (OaCA) are proposed to model the relationship among multiple objects and designed to be object-aware and position-sensitive, allowing for precisely controlling the spatial related information. Extensive experiments show that our LayoutDiffusion outperforms the previous SOTA methods on FID, CAS by relatively 46.35%, 26.70% on COCO-stuff and 44.29%, 41.82% on VG. Code is available at https://github.com/ZGCTroy/LayoutDiffusion.
Paper Structure (33 sections, 21 equations, 16 figures, 7 tables)

This paper contains 33 sections, 21 equations, 16 figures, 7 tables.

Figures (16)

  • Figure 1: Compared to text, the layout allows diffusion models to obtain more control over the objects while maintaining high quality. Unlike the prevailing methods, we propose a diffusion model named LayoutDiffusion for layout-to-image generation. We transform the difficult multimodal fusion of the image and layout into a unified form by constructing a structural image patch with region information and regarding the patched image as a special layout.
  • Figure 2: The whole pipeline of LayoutDiffusion. The layout that consisted of bounding box $b$ and objects categories $c$ is transformed into embedding $B_{\mathcal{L}},C_{\mathcal{L}},L$. Then Layout Fusion Module fuses layout embedding $L$ to output the fused layout embedding $L'$. Finally, Image-Layout Fusion Module including direct addition used for global conditioning and Object-aware Cross Attention (OaCA) used for local conditioning, will fuse the layout related $B_{\mathcal{L}},C_{\mathcal{L}},L'$ and the image feature $I$ at multiple resolutions.
  • Figure 3: Visualization of comparision with SOTA methods on COCO-stuff 256$\times$256. LayoutDiffusion has better generation quality and stronger controllability compared to the other methods.
  • Figure 4: The diversity of LayoutDiffusion. Each row of images are from the same layout and have great difference.
  • Figure 5: The interactivity of LayoutDiffusion. We add extra layout continuously, and the new objects are also with high quality.
  • ...and 11 more figures