Table of Contents
Fetching ...

Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation

Daichi Horita, Naoto Inoue, Kotaro Kikuchi, Kota Yamaguchi, Kiyoharu Aizawa

TL;DR

This work tackles content-aware layout generation under data scarcity by introducing Retrieval-Augmented Layout Transformer (RALF). By retrieving and encoding nearest neighbor layouts and fusing them with the input canvas via cross-attention, RALF augments an autoregressive layout generator to produce higher-quality, controllable layouts with less training data. The approach demonstrates strong improvements on PKU and CGL posters in both unconstrained and constrained settings, and generalizes across domains and other generative models when paired with retrieval. Overall, retrieval augmentation proves effective in mitigating data limitations for structured graphic design tasks and enables adaptable, high-quality content-aware layouts. The method's modular design and use of cross-modal retrieval offer practical benefits for real-world design pipelines and potential extensions to broader poster generation beyond bounding boxes.

Abstract

Content-aware graphic layout generation aims to automatically arrange visual elements along with a given content, such as an e-commerce product image. In this paper, we argue that the current layout generation approaches suffer from the limited training data for the high-dimensional layout structure. We show that a simple retrieval augmentation can significantly improve the generation quality. Our model, which is named Retrieval-Augmented Layout Transformer (RALF), retrieves nearest neighbor layout examples based on an input image and feeds these results into an autoregressive generator. Our model can apply retrieval augmentation to various controllable generation tasks and yield high-quality layouts within a unified architecture. Our extensive experiments show that RALF successfully generates content-aware layouts in both constrained and unconstrained settings and significantly outperforms the baselines.

Retrieval-Augmented Layout Transformer for Content-Aware Layout Generation

TL;DR

This work tackles content-aware layout generation under data scarcity by introducing Retrieval-Augmented Layout Transformer (RALF). By retrieving and encoding nearest neighbor layouts and fusing them with the input canvas via cross-attention, RALF augments an autoregressive layout generator to produce higher-quality, controllable layouts with less training data. The approach demonstrates strong improvements on PKU and CGL posters in both unconstrained and constrained settings, and generalizes across domains and other generative models when paired with retrieval. Overall, retrieval augmentation proves effective in mitigating data limitations for structured graphic design tasks and enables adaptable, high-quality content-aware layouts. The method's modular design and use of cross-modal retrieval offer practical benefits for real-world design pipelines and potential extensions to broader poster generation beyond bounding boxes.

Abstract

Content-aware graphic layout generation aims to automatically arrange visual elements along with a given content, such as an e-commerce product image. In this paper, we argue that the current layout generation approaches suffer from the limited training data for the high-dimensional layout structure. We show that a simple retrieval augmentation can significantly improve the generation quality. Our model, which is named Retrieval-Augmented Layout Transformer (RALF), retrieves nearest neighbor layout examples based on an input image and feeds these results into an autoregressive generator. Our model can apply retrieval augmentation to various controllable generation tasks and yield high-quality layouts within a unified architecture. Our extensive experiments show that RALF successfully generates content-aware layouts in both constrained and unconstrained settings and significantly outperforms the baselines.
Paper Structure (55 sections, 3 equations, 15 figures, 13 tables)

This paper contains 55 sections, 3 equations, 15 figures, 13 tables.

Figures (15)

  • Figure 1: Retrieval-augmented content-aware layout generation. We retrieve nearest neighbor examples based on the input image and use them as a reference to augment the generation process.
  • Figure 2: Overview of Retrieval-Augmented Layout Transformer (RALF). RALF takes a canvas image and a saliency map as input, and then autoregressively generates a layout along with the input image. Our model uses (a) retrieval augmentation that incorporates useful examples to better capture the relationship between the image and the layout, and (b) constraint serialization, an optional module that encodes user-specified requirements, enabling the generation of layouts that adhere to specific requirements for controllable generation.
  • Figure 3: Visual comparison of unconstrained generation with baselines. Input canvases are selected from the unannotated split.
  • Figure 4: FID over the training dataset size (#TrainingDataset), which has up to 7,734 samples.
  • Figure 5: FID over the retrieval size $K$ (#Retrieval).
  • ...and 10 more figures