Table of Contents
Fetching ...

LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

Yuzhuo Chen, Zehua Ma, Jianhua Wang, Kai Kang, Shunyu Yao, Weiming Zhang

TL;DR

LAMIC tackles layout-aware multi-image synthesis without additional training by extending single-reference diffusion models to handle multiple references in a unified framework. It builds on the Multimodal Diffusion Transformer (MMDiT) and introduces two plug-and-play attentions, Group Isolation Attention (GIA) and Region-Modulated Attention (RMA), to disentangle entities and control spatial layouts. The method defines structured VTS triplets, encodes them into unified tokens, and performs Multi-VTS guided generation with a two-stage attention scheme, yielding strong identity preservation and layout fidelity. The results demonstrate state-of-the-art performance across identity, background, and layout metrics in zero-shot scenarios, establishing a scalable, training-free path for controllable multi-image composition.

Abstract

In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.

LAMIC: Layout-Aware Multi-Image Composition via Scalability of Multimodal Diffusion Transformer

TL;DR

LAMIC tackles layout-aware multi-image synthesis without additional training by extending single-reference diffusion models to handle multiple references in a unified framework. It builds on the Multimodal Diffusion Transformer (MMDiT) and introduces two plug-and-play attentions, Group Isolation Attention (GIA) and Region-Modulated Attention (RMA), to disentangle entities and control spatial layouts. The method defines structured VTS triplets, encodes them into unified tokens, and performs Multi-VTS guided generation with a two-stage attention scheme, yielding strong identity preservation and layout fidelity. The results demonstrate state-of-the-art performance across identity, background, and layout metrics in zero-shot scenarios, establishing a scalable, training-free path for controllable multi-image composition.

Abstract

In controllable image synthesis, generating coherent and consistent images from multiple references with spatial layout awareness remains an open challenge. We present LAMIC, a Layout-Aware Multi-Image Composition framework that, for the first time, extends single-reference diffusion models to multi-reference scenarios in a training-free manner. Built upon the MMDiT model, LAMIC introduces two plug-and-play attention mechanisms: 1) Group Isolation Attention (GIA) to enhance entity disentanglement; and 2) Region-Modulated Attention (RMA) to enable layout-aware generation. To comprehensively evaluate model capabilities, we further introduce three metrics: 1) Inclusion Ratio (IN-R) and Fill Ratio (FI-R) for assessing layout control; and 2) Background Similarity (BG-S) for measuring background consistency. Extensive experiments show that LAMIC achieves state-of-the-art performance across most major metrics: it consistently outperforms existing multi-reference baselines in ID-S, BG-S, IN-R and AVG scores across all settings, and achieves the best DPG in complex composition tasks. These results demonstrate LAMIC's superior abilities in identity keeping, background preservation, layout control, and prompt-following, all achieved without any training or fine-tuning, showcasing strong zero-shot generalization ability. By inheriting the strengths of advanced single-reference models and enabling seamless extension to multi-image scenarios, LAMIC establishes a new training-free paradigm for controllable multi-image composition. As foundation models continue to evolve, LAMIC's performance is expected to scale accordingly. Our implementation is available at: https://github.com/Suchenl/LAMIC.

Paper Structure

This paper contains 29 sections, 10 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: An example of layout-aware multi-image composition generated by our proposed model, LAMIC.
  • Figure 2: Framework of our proposed LAMIC. We illustrate the layout-aware multi-image composition process with 5 reference groups (n=5) provided as input.
  • Figure 3: Our proposed attention mechanisms.
  • Figure 4: Visual comparison of different methods under different multi-reference images.
  • Figure 5: Visual comparison of different settings of LAMIC under layout-aware multi-image composition.
  • ...and 3 more figures