Table of Contents
Fetching ...

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

Tao Yang, Yingmin Luo, Zhongang Qi, Yang Wu, Ying Shan, Chang Wen Chen

TL;DR

PosterLLaVa introduces a unified multi-modal layout generator that encodes design layouts as tokenized JSON and processes them with a vision-language model to satisfy complex visual and textual constraints. The approach unifies diverse layout tasks under a single framework, supported by visual instruction tuning, and is validated on public benchmarks plus two new datasets. It also delivers PosterGen, a text-to-poster pipeline that produces editable, multilingual posters, bridging layout generation with real-world design workflows. Across benchmarks and new data, PosterLLaVa demonstrates state-of-the-art performance and practical viability for large-scale automated graphic design.

Abstract

Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-following manner. Previous approaches are either inefficient for large-scale applications or lack flexibility for varying design requirements. Our research introduces a unified framework for automated graphic layout generation, leveraging the multi-modal large language model (MLLM) to accommodate diverse design tasks. In contrast, our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts under specific visual and textual constraints, including user-defined natural language specifications. We conducted extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks, demonstrating the effectiveness of our method. Moreover, recognizing existing datasets' limitations in capturing the complexity of real-world graphic designs, we propose two new datasets for much more challenging tasks (user-constrained generation and complicated poster), further validating our model's utility in real-life settings. Marking by its superior accessibility and adaptability, this approach further automates large-scale graphic design tasks. Finally, we develop an automated text-to-poster system that generates editable SVG posters based on users' design intentions, bridging the gap between layout generation and real-world graphic design applications. This system integrates our proposed layout generation method as the core component, demonstrating its effectiveness in practical scenarios. The code and datasets are open-sourced on https://github.com/posterllava/PosterLLaVA.

PosterLLaVa: Constructing a Unified Multi-modal Layout Generator with LLM

TL;DR

PosterLLaVa introduces a unified multi-modal layout generator that encodes design layouts as tokenized JSON and processes them with a vision-language model to satisfy complex visual and textual constraints. The approach unifies diverse layout tasks under a single framework, supported by visual instruction tuning, and is validated on public benchmarks plus two new datasets. It also delivers PosterGen, a text-to-poster pipeline that produces editable, multilingual posters, bridging layout generation with real-world design workflows. Across benchmarks and new data, PosterLLaVa demonstrates state-of-the-art performance and practical viability for large-scale automated graphic design.

Abstract

Layout generation is the keystone in achieving automated graphic design, requiring arranging the position and size of various multi-modal design elements in a visually pleasing and constraint-following manner. Previous approaches are either inefficient for large-scale applications or lack flexibility for varying design requirements. Our research introduces a unified framework for automated graphic layout generation, leveraging the multi-modal large language model (MLLM) to accommodate diverse design tasks. In contrast, our data-driven method employs structured text (JSON format) and visual instruction tuning to generate layouts under specific visual and textual constraints, including user-defined natural language specifications. We conducted extensive experiments and achieved state-of-the-art (SOTA) performance on public multi-modal layout generation benchmarks, demonstrating the effectiveness of our method. Moreover, recognizing existing datasets' limitations in capturing the complexity of real-world graphic designs, we propose two new datasets for much more challenging tasks (user-constrained generation and complicated poster), further validating our model's utility in real-life settings. Marking by its superior accessibility and adaptability, this approach further automates large-scale graphic design tasks. Finally, we develop an automated text-to-poster system that generates editable SVG posters based on users' design intentions, bridging the gap between layout generation and real-world graphic design applications. This system integrates our proposed layout generation method as the core component, demonstrating its effectiveness in practical scenarios. The code and datasets are open-sourced on https://github.com/posterllava/PosterLLaVA.
Paper Structure (18 sections, 2 equations, 8 figures, 8 tables)

This paper contains 18 sections, 2 equations, 8 figures, 8 tables.

Figures (8)

  • Figure 1: The overall framework of our proposed content-aware layout generation method. Adopting the multi-modal LLM LLaVa as the central processing unit, we embed information from visual and textual domains to generate a reasonable and visually pleasing graphic layout. The result is encoded in JSON format and can be rendered into a real-world poster.
  • Figure 2: Qualitative results on the PosterLayout (top), Youtube (middle), and QB-Poster (bottom) datasets. PosterLLaVa achieves the highest overall generation quality on all three datasets.
  • Figure 3: Qualitative results on the User-constrained Poster dataset. The user requirement texts are shown on the left side, and the bold requirement means it was violated by either method.
  • Figure 4: The text-to-poster generation system PosterGen, in which graphic design generation can be decomposed into a) intention analysis, b) text-to-image background generation, c) content-aware layout generation, and d) text attributes generation (font, color, etc.). As illustrated, PosterGen can correctly interpret the user's design intention to create high-quality background images and display key information in an attention-grabbing location.
  • Figure 5: The qualitative comparison on DESIGNERINTENSION cole benchmark with recently proposed poster generation method COLE cole and OpenCOLE opencole, and text-to-image modelssdxldalle3flux (GPT-4 based prompt augmentation are adopted as advised by cole). Our method has better editability than vanilla T2I schemes and outperforms competitors with better background quality, text readability, and fewer training resources.
  • ...and 3 more figures