Table of Contents
Fetching ...

UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation

Zeyang Liu, Le Wang, Sanping Zhou, Yuxuan Wu, Xiaolong Sun, Gang Hua, Haoxiang Li

TL;DR

This work tackles content-aware layout generation by introducing UniLayDiff, a unified diffusion-based framework that treats layout constraints as a separate modality within a Multi-Modal Diffusion Transformer. It enables end-to-end conditional generation across a wide spectrum of tasks, using unified inputs (image, saliency, bounding boxes, partial constraints, and relation constraints) and a dual-path attention mechanism to model cross- and intra-modal interactions. Relational constraints are incorporated via a structured relative bias and a dedicated $\mathcal{L}_{\mathrm{rel}}$ loss, with LoRA fine-tuning to inject relation-specific behavior without degrading the pretrained generative priors. Extensive experiments on PKU and CGL show state-of-the-art performance across unconditional and multiple conditioned tasks, along with strong generalization, ablations, and qualitative results that confirm coherent, visually appealing, and constraint-adherent layouts.

Abstract

Content-aware layout generation is a critical task in graphic design automation, focused on creating visually appealing arrangements of elements that seamlessly blend with a given background image. The variety of real-world applications makes it highly challenging to develop a single model capable of unifying the diverse range of input-constrained generation sub-tasks, such as those conditioned by element types, sizes, or their relationships. Current methods either address only a subset of these tasks or necessitate separate model parameters for different conditions, failing to offer a truly unified solution. In this paper, we propose UniLayDiff: a Unified Diffusion Transformer, that for the first time, addresses various content-aware layout generation tasks with a single, end-to-end trainable model. Specifically, we treat layout constraints as a distinct modality and employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints. Moreover, we integrate relation constraints through fine-tuning the model with LoRA after pretraining the model on other tasks. Such a schema not only achieves unified conditional generation but also enhances overall layout quality. Extensive experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks and, to the best of our knowledge, is the first model to unify the full range of content-aware layout generation tasks.

UniLayDiff: A Unified Diffusion Transformer for Content-Aware Layout Generation

TL;DR

This work tackles content-aware layout generation by introducing UniLayDiff, a unified diffusion-based framework that treats layout constraints as a separate modality within a Multi-Modal Diffusion Transformer. It enables end-to-end conditional generation across a wide spectrum of tasks, using unified inputs (image, saliency, bounding boxes, partial constraints, and relation constraints) and a dual-path attention mechanism to model cross- and intra-modal interactions. Relational constraints are incorporated via a structured relative bias and a dedicated loss, with LoRA fine-tuning to inject relation-specific behavior without degrading the pretrained generative priors. Extensive experiments on PKU and CGL show state-of-the-art performance across unconditional and multiple conditioned tasks, along with strong generalization, ablations, and qualitative results that confirm coherent, visually appealing, and constraint-adherent layouts.

Abstract

Content-aware layout generation is a critical task in graphic design automation, focused on creating visually appealing arrangements of elements that seamlessly blend with a given background image. The variety of real-world applications makes it highly challenging to develop a single model capable of unifying the diverse range of input-constrained generation sub-tasks, such as those conditioned by element types, sizes, or their relationships. Current methods either address only a subset of these tasks or necessitate separate model parameters for different conditions, failing to offer a truly unified solution. In this paper, we propose UniLayDiff: a Unified Diffusion Transformer, that for the first time, addresses various content-aware layout generation tasks with a single, end-to-end trainable model. Specifically, we treat layout constraints as a distinct modality and employ Multi-Modal Diffusion Transformer framework to capture the complex interplay between the background image, layout elements, and diverse constraints. Moreover, we integrate relation constraints through fine-tuning the model with LoRA after pretraining the model on other tasks. Such a schema not only achieves unified conditional generation but also enhances overall layout quality. Extensive experiments demonstrate that UniLayDiff achieves state-of-the-art performance across from unconditional to various conditional generation tasks and, to the best of our knowledge, is the first model to unify the full range of content-aware layout generation tasks.

Paper Structure

This paper contains 16 sections, 13 equations, 8 figures, 15 tables.

Figures (8)

  • Figure 1: Previous models like RALF and LayoutDiT handle different conditional generation tasks with different parameters or only support a part of conditional tasks, whereas our proposed model, UniLayDiff, is able to handle various content-aware conditional generation tasks in a unified diffusion model.
  • Figure 2: Overview of UniLayDiff. (a) Our framework takes inputs including the image $\mathbf{I}$ with its saliency map $\mathbf{S}$, a noisy layout $\mathbf{x}_t$, a partial constraint mask $\mathbf{M}_P$ (sampled from various tasks), and a relation matrix $\mathbf{M}_R$. (b) Features $\mathbf{F}_I$, $\mathbf{F}_L$, and $\mathbf{F}_M$ interact via the dual-path MM-Attention module. (c) Relational constraints are integrated via a block-diagonal relative positional bias (where ${L}_i$ denotes the $i$-th layout element,${M}_i$ the $i$-th mask element, and ${L}_0$ represents the canvas). This relation module, and parts of the mask interaction branch, are activated during the LoRA fine-tuning stage. Our model is first pre-trained on four fundamental tasks (a.2), then fine-tuned for the relation task and enhanced quality. The resulting, fine-tuned model is used for inference on all tasks.
  • Figure 3: Visual comparison with baselines on the PKU annotated test split.
  • Figure 4: Visual ablation study on the relationship injection strategy on the PKU dataset.
  • Figure A: Visual comparison with baselines on the CGL annotated test split.
  • ...and 3 more figures