Table of Contents
Fetching ...

Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation

Riccardo Brioschi, Aleksandr Alekseev, Emanuele Nevali, Berkay Döner, Omar El Malki, Blagoj Mitrevski, Leandro Kieliger, Mark Collier, Andrii Maksai, Jesse Berent, Claudiu Musat, Efi Kokiopoulou

TL;DR

This paper addresses the challenge of generating graphic layouts guided by intuitive user input. It proposes a multimodal transformer framework that conditions on a sketch plus image and text assets, with a synthetic sketch-generation pipeline to enable scalable training. The approach, implemented with PaLIGemma 3B and a structured protocol buffer layout, outperforms state-of-the-art constraint-based methods by substantial margins on PubLayNet, DocLayNet, and SlidesVQA, and shows robust performance with both synthetic and real sketches. The work also introduces a content-awareness metric and demonstrates the value of asset content in guiding layout generation, offering a scalable pathway for future sketch-to-layout research and practical design tools.

Abstract

Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. We introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. To tackle the sketch-to-layout problem, we propose a multimodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. We train and evaluate our model on three publicly available datasets: PubLayNet, DocLayNet and SlidesVQA, demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above. The datasets are available at https://github.com/google-deepmind/sketch_to_layout.

Sketch-to-Layout: Sketch-Guided Multimodal Layout Generation

TL;DR

This paper addresses the challenge of generating graphic layouts guided by intuitive user input. It proposes a multimodal transformer framework that conditions on a sketch plus image and text assets, with a synthetic sketch-generation pipeline to enable scalable training. The approach, implemented with PaLIGemma 3B and a structured protocol buffer layout, outperforms state-of-the-art constraint-based methods by substantial margins on PubLayNet, DocLayNet, and SlidesVQA, and shows robust performance with both synthetic and real sketches. The work also introduces a content-awareness metric and demonstrates the value of asset content in guiding layout generation, offering a scalable pathway for future sketch-to-layout research and practical design tools.

Abstract

Graphic layout generation is a growing research area focusing on generating aesthetically pleasing layouts ranging from poster designs to documents. While recent research has explored ways to incorporate user constraints to guide the layout generation, these constraints often require complex specifications which reduce usability. We introduce an innovative approach exploiting user-provided sketches as intuitive constraints and we demonstrate empirically the effectiveness of this new guidance method, establishing the sketch-to-layout problem as a promising research direction, which is currently under-explored. To tackle the sketch-to-layout problem, we propose a multimodal transformer-based solution using the sketch and the content assets as inputs to produce high quality layouts. Since collecting sketch training data from human annotators to train our model is very costly, we introduce a novel and efficient method to synthetically generate training sketches at scale. We train and evaluate our model on three publicly available datasets: PubLayNet, DocLayNet and SlidesVQA, demonstrating that it outperforms state-of-the-art constraint-based methods, while offering a more intuitive design experience. In order to facilitate future sketch-to-layout research, we release O(200k) synthetically-generated sketches for the public datasets above. The datasets are available at https://github.com/google-deepmind/sketch_to_layout.

Paper Structure

This paper contains 36 sections, 1 equation, 18 figures, 5 tables.

Figures (18)

  • Figure 1: Our sketch-to-layout approach leverages sketches to guide the generation of multimodal layouts in a natural and intuitive way.
  • Figure 2: Time-performance trade-off between guidance methods on the PubLayNet dataset.
  • Figure 3: Our method: a sketch, alongside image and text assets are given to a VLM which generates the structured representation format of the layout, which can be rendered as an image.
  • Figure 4: Synthetic Sketch Generation Pipeline. Every asset is matched with a stroke primitive based on its attributes and strokes are rescaled and combined to generate the synthetic sketch.
  • Figure 5: Different coverage rates.
  • ...and 13 more figures