Table of Contents
Fetching ...

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

Hui Zhang, Dexiang Hong, Yitong Wang, Jie Shao, Xinglong Wu, Zuxuan Wu, Yu-Gang Jiang

TL;DR

CreatiLayout introduces SiamLayout, a Siamese multimodal diffusion Transformer that treats layout as an independent modality for layout-to-image generation. By decoupling image-text and image-layout interactions, SiamLayout mitigates modality competition and leverages MM-DiT to produce high-quality, fine-grained attribute rendering guided by layout. The work provides LayoutSAM, a large-scale open-set layout dataset, and LayoutDesigner, an LLM-based planner that converts diverse user inputs into harmonious layouts, enabling end-to-end creative layout-to-image generation. Across LayoutSAM-Eval, COCO, and T2I benchmarks, SiamLayout achieves state-of-the-art performance in spatial adherence and attribute fidelity, demonstrating the practical impact of integrating layout with MM-DiTs and LLM-driven planning. The framework offers a scalable path for advanced content creation with precise spatial control and complex descriptive attributes.

Abstract

Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (\eg SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box and a detailed description. We further construct the LayoutSAM-Eval benchmark as a comprehensive tool for evaluating the L2I generation quality. Finally, we introduce the Layout Designer, which taps into the potential of large language models in layout planning, transforming them into experts in layout generation and optimization. These components form CreatiLayout -- a systematic solution that integrates the layout model, dataset, and planner for creative layout-to-image generation.

CreatiLayout: Siamese Multimodal Diffusion Transformer for Creative Layout-to-Image Generation

TL;DR

CreatiLayout introduces SiamLayout, a Siamese multimodal diffusion Transformer that treats layout as an independent modality for layout-to-image generation. By decoupling image-text and image-layout interactions, SiamLayout mitigates modality competition and leverages MM-DiT to produce high-quality, fine-grained attribute rendering guided by layout. The work provides LayoutSAM, a large-scale open-set layout dataset, and LayoutDesigner, an LLM-based planner that converts diverse user inputs into harmonious layouts, enabling end-to-end creative layout-to-image generation. Across LayoutSAM-Eval, COCO, and T2I benchmarks, SiamLayout achieves state-of-the-art performance in spatial adherence and attribute fidelity, demonstrating the practical impact of integrating layout with MM-DiTs and LLM-driven planning. The framework offers a scalable path for advanced content creation with precise spatial control and complex descriptive attributes.

Abstract

Diffusion models have been recognized for their ability to generate images that are not only visually appealing but also of high artistic quality. As a result, Layout-to-Image (L2I) generation has been proposed to leverage region-specific positions and descriptions to enable more precise and controllable generation. However, previous methods primarily focus on UNet-based models (\eg SD1.5 and SDXL), and limited effort has explored Multimodal Diffusion Transformers (MM-DiTs), which have demonstrated powerful image generation capabilities. Enabling MM-DiT for layout-to-image generation seems straightforward but is challenging due to the complexity of how layout is introduced, integrated, and balanced among multiple modalities. To this end, we explore various network variants to efficiently incorporate layout guidance into MM-DiT, and ultimately present SiamLayout. To inherit the advantages of MM-DiT, we use a separate set of network weights to process the layout, treating it as equally important as the image and text modalities. Meanwhile, to alleviate the competition among modalities, we decouple the image-layout interaction into a siamese branch alongside the image-text one and fuse them in the later stage. Moreover, we contribute a large-scale layout dataset, named LayoutSAM, which includes 2.7 million image-text pairs and 10.7 million entities. Each entity is annotated with a bounding box and a detailed description. We further construct the LayoutSAM-Eval benchmark as a comprehensive tool for evaluating the L2I generation quality. Finally, we introduce the Layout Designer, which taps into the potential of large language models in layout planning, transforming them into experts in layout generation and optimization. These components form CreatiLayout -- a systematic solution that integrates the layout model, dataset, and planner for creative layout-to-image generation.

Paper Structure

This paper contains 48 sections, 9 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: We present a novel approach to empower MM-DiTs for layout-to-image generation, achieving high-quality and fine-grained controllable generation, as evidenced by the precise rendering of complex attributes (e.g. color, texture, shape, and text).
  • Figure 2: An overview of the proposed pipeline. Layout tokens are derived from the layout encoder based on spatial locations and region descriptions. SiamLayout employs separate transformer parameters to process the layout, treating it as an equally important modality as the image and text. Layout and text guide the image independently through siamese branches, and are then fused in the later stage. We experiment with two additional network variants that incorporate layout via cross-attention and $\mathrm{M}^3$-Attention. SiamLayout works best.
  • Figure 3: Attention maps of $\mathbf{M}^3$-Attention and SiamLayout. (a) The layout's influence on image generation is much lower compared to the text due to lower similarity. (b) SiamLayout decouples (a) into two independent MM-Attentions for image-text and image-layout, enabling equal guidance from both layout and text.
  • Figure 4: An overview of the automatic annotation pipeline.
  • Figure 5: An overview of layout generation and optimization.
  • ...and 10 more figures