LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Ning Yu; Chia-Chih Chen; Zeyuan Chen; Rui Meng; Gang Wu; Paul Josel; Juan Carlos Niebles; Caiming Xiong; Ran Xu

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Ning Yu, Chia-Chih Chen, Zeyuan Chen, Rui Meng, Gang Wu, Paul Josel, Juan Carlos Niebles, Caiming Xiong, Ran Xu

TL;DR

LayoutDETR addresses multimodal graphic layout design by reframing layout generation as a detection problem and leveraging a DETR-based architecture to fuse background context with multimodal foreground inputs. It supports GAN-, VAE-, and VAE-GAN–based generator variants, trained with a combination of adversarial, variational, and layout-specific losses, including $L_{gIoU}$, $L_{overlap}$, and $L_{misalign}$, to produce realistic and regular layouts. A new large-scale ad banner dataset with rich semantic annotations is introduced, and LayoutDETR achieves state-of-the-art realism, accuracy, and regularity across ad banners and related multimodal benchmarks, validated by a graphical system and significant user preferences. The work provides practical deployment via a graphical design system and releases code, models, and the dataset, enabling scalable multimodal layout design with strong designer-aligned performance.

Abstract

Graphic layout designs play an essential role in visual communication. Yet handcrafting layout designs is skill-demanding, time-consuming, and non-scalable to batch production. Generative models emerge to make design automation scalable but it remains non-trivial to produce designs that comply with designers' multimodal desires, i.e., constrained by background images and driven by foreground content. We propose LayoutDETR that inherits the high quality and realism from generative modeling, while reformulating content-aware requirements as a detection problem: we learn to detect in a background image the reasonable locations, scales, and spatial relations for multimodal foreground elements in a layout. Our solution sets a new state-of-the-art performance for layout generation on public benchmarks and on our newly-curated ad banner dataset. We integrate our solution into a graphical system that facilitates user studies, and show that users prefer our designs over baselines by significant margins. Code, models, dataset, and demos are available at https://github.com/salesforce/LayoutDETR.

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

TL;DR

, and

, to produce realistic and regular layouts. A new large-scale ad banner dataset with rich semantic annotations is introduced, and LayoutDETR achieves state-of-the-art realism, accuracy, and regularity across ad banners and related multimodal benchmarks, validated by a graphical system and significant user preferences. The work provides practical deployment via a graphical design system and releases code, models, and the dataset, enabling scalable multimodal layout design with strong designer-aligned performance.

Abstract

Paper Structure (19 sections, 12 equations, 10 figures, 4 tables)

This paper contains 19 sections, 12 equations, 10 figures, 4 tables.

Introduction
Related Work
LayoutDETR
Generative Learning Frameworks
Additional Objectives
DETR-based Multimodal Architectures
Motivations of Using Each Network Component
New Ad Banner Dataset
Experiments
Ablation Study
Comparisons to Baselines
Graphical System Design and User Study
Conclusion
Supplementary material
Implementation Details
...and 4 more sections

Figures (10)

Figure 1: Left: LayoutDETR takes a background image and a set of multimodal foreground elements (images/texts) as input, and outputs an aesthetically appealing layout. Right: we show a few banner samples with rendered texts using our auto-designed layouts. "C" is the composition and "R" the rendering process.
Figure 2: Our unified training framework covers three generator variants: GAN-, VAE-, and VAE-GAN-based. The layout generator network (darker color and bold) appears in all variants. Its DETR-based multimodal architecture is at the bottom left. During inference, only the generator is needed.
Figure 3: Radar plots for Table \ref{['tab:eval']} on our ad banner dataset (top), CGL Chinese ad banner dataset (middle), and CLAY mobile application UI dataset (bottom). Each plot corresponds to a row (method) in the table. Each corner in a plot corresponds to a column (metric) in the table. Values are normalized to the unit range, the higher the better.
Figure 4: Left: comparisons on the testing set of our ad banner dataset. We apply the same rendering process to all methods: (1) Text font sizes and line breakers are adaptively determined to tightly fit into their inferred boxes. (2) Text font colors and button pad colors are adaptively determined to be either black or white whichever contrasts more with the background. (3) Button text colors are then determined to contrast with the button pads. (4) Text font is set to Arial. (5) Boxes are enforced to horizontally center-align with each other. Middle: comparisons on CGL Chinese ad banner dataset. Image patches that contain foreground text elements are resized and overlaid on the background following the generated layouts. Right: comparisons on CLAY mobile application UI dataset.
Figure 5: Top: AMT interface with instructions on the left for users to annotate the bounding box and class of each existing copywriting text on each image. Bottom: one instructional example of the definitions of text bounding boxes and text classes.
...and 5 more figures

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

TL;DR

Abstract

LayoutDETR: Detection Transformer Is a Good Multimodal Layout Designer

Authors

TL;DR

Abstract

Table of Contents

Figures (10)