Table of Contents
Fetching ...

InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention

Qiang Xiang, Shuang Sun, Binglei Li, Dejia Song, Huaxia Li, Nemo Chen, Xu Tang, Yao Hu, Junping Zhang

TL;DR

This work targets layout-controlled image generation by introducing InstanceAssemble, a diffusion-based framework that conditions on per-instance layouts via a Layout Encoder and an instance-wise Assemble-MMDiT. The method enables precise position control with bounding boxes and supports multimodal content, including textual and visual references, while employing LoRA for lightweight adaptation and preserving base-model capabilities. It also contributes a dense layout benchmark DenseLayout and the Layout Grounding Score (LGS) for interpretable evaluation of spatial and semantic alignment. Empirical results show state-of-the-art performance under complex layouts and robust cross-domain generalization, with notable gains when incorporating visual instance content. Overall, the approach provides a scalable, flexible pathway for fine-grained, layout-aware image synthesis with practical implications for design automation and multimodal generation tasks.

Abstract

Diffusion models have demonstrated remarkable capabilities in generating high-quality images. Recent advancements in Layout-to-Image (L2I) generation have leveraged positional conditions and textual descriptions to facilitate precise and controllable image synthesis. Despite overall progress, current L2I methods still exhibit suboptimal performance. Therefore, we propose InstanceAssemble, a novel architecture that incorporates layout conditions via instance-assembling attention, enabling position control with bounding boxes (bbox) and multimodal content control including texts and additional visual content. Our method achieves flexible adaption to existing DiT-based T2I models through light-weighted LoRA modules. Additionally, we propose a Layout-to-Image benchmark, Denselayout, a comprehensive benchmark for layout-to-image generation, containing 5k images with 90k instances in total. We further introduce Layout Grounding Score (LGS), an interpretable evaluation metric to more precisely assess the accuracy of L2I generation. Experiments demonstrate that our InstanceAssemble method achieves state-of-the-art performance under complex layout conditions, while exhibiting strong compatibility with diverse style LoRA modules. The code and pretrained models are publicly available at https://github.com/FireRedTeam/InstanceAssemble.

InstanceAssemble: Layout-Aware Image Generation via Instance Assembling Attention

TL;DR

This work targets layout-controlled image generation by introducing InstanceAssemble, a diffusion-based framework that conditions on per-instance layouts via a Layout Encoder and an instance-wise Assemble-MMDiT. The method enables precise position control with bounding boxes and supports multimodal content, including textual and visual references, while employing LoRA for lightweight adaptation and preserving base-model capabilities. It also contributes a dense layout benchmark DenseLayout and the Layout Grounding Score (LGS) for interpretable evaluation of spatial and semantic alignment. Empirical results show state-of-the-art performance under complex layouts and robust cross-domain generalization, with notable gains when incorporating visual instance content. Overall, the approach provides a scalable, flexible pathway for fine-grained, layout-aware image synthesis with practical implications for design automation and multimodal generation tasks.

Abstract

Diffusion models have demonstrated remarkable capabilities in generating high-quality images. Recent advancements in Layout-to-Image (L2I) generation have leveraged positional conditions and textual descriptions to facilitate precise and controllable image synthesis. Despite overall progress, current L2I methods still exhibit suboptimal performance. Therefore, we propose InstanceAssemble, a novel architecture that incorporates layout conditions via instance-assembling attention, enabling position control with bounding boxes (bbox) and multimodal content control including texts and additional visual content. Our method achieves flexible adaption to existing DiT-based T2I models through light-weighted LoRA modules. Additionally, we propose a Layout-to-Image benchmark, Denselayout, a comprehensive benchmark for layout-to-image generation, containing 5k images with 90k instances in total. We further introduce Layout Grounding Score (LGS), an interpretable evaluation metric to more precisely assess the accuracy of L2I generation. Experiments demonstrate that our InstanceAssemble method achieves state-of-the-art performance under complex layout conditions, while exhibiting strong compatibility with diverse style LoRA modules. The code and pretrained models are publicly available at https://github.com/FireRedTeam/InstanceAssemble.

Paper Structure

This paper contains 31 sections, 9 equations, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Layout-aware image generation result by InstanceAssemble. We show image generation result under precise layout control, ranging from simple to intricate, sparse to dense layouts.
  • Figure 2: The proposed InstanceAssemble pipeline. Various layout conditions are processed by the Layout Encoder to obtain instance tokens, which guide the image generation via Assemble-MMDiT. In Assemble-MMDiT, the instance tokens interact with image tokens through the Assembling-Attn.
  • Figure 3: (Top) instance-image attention map w/ layout. (Middle) global prompt-image attention map w/ layout. (Bottom) global prompt-image attention map w/o layout.
  • Figure 4: Failure cases of other metrics. (a) false acceptance in CropVQA,(b) false rejection in CropVQA, (c) localization error in SAMIoU, and (d) discontinuous in BinaryIoU.
  • Figure 5: Qualitative comparison of InstanceAssemble with other methods.
  • ...and 13 more figures