Table of Contents
Fetching ...

Automatic Text Box Placement for Supporting Typographic Design

Jun Muraoka, Daichi Haraguchi, Naoto Inoue, Wataru Shimoda, Kota Yamaguchi, Seiichi Uchida

TL;DR

This work addresses automatic text box placement within partially completed multimodal layouts by comparing a task-tailored Transformer, a fine-tuned small VLM ($\text{Phi3.5-vision}$), and a large pretrained VLM (Gemini), including an extended Transformer that processes per-element images. On the Crello dataset, the Transformer-based model achieves the best $IoU$ and $BDE$ metrics across single- and multi-text layouts, while Phi3.5-vision outperforms Gemini but remains behind the Transformer; incorporating multiple image inputs further improves performance. The findings emphasize the value of task-specific architectures for capturing spatial relationships in layout design and suggest avenues for improvement, such as joint multi-box optimization, probabilistic placement representations, and ensemble strategies to handle challenging layouts. This work advances automated typographic layout design by clarifying the relative strengths and limitations of specialized transformers versus pretrained VLMs in real-world design tasks.

Abstract

In layout design for advertisements and web pages, balancing visual appeal and communication efficiency is crucial. This study examines automated text box placement in incomplete layouts, comparing a standard Transformer-based method, a small Vision and Language Model (Phi3.5-vision), a large pretrained VLM (Gemini), and an extended Transformer that processes multiple images. Evaluations on the Crello dataset show the standard Transformer-based models generally outperform VLM-based approaches, particularly when incorporating richer appearance information. However, all methods face challenges with very small text or densely populated layouts. These findings highlight the benefits of task-specific architectures and suggest avenues for further improvement in automated layout design.

Automatic Text Box Placement for Supporting Typographic Design

TL;DR

This work addresses automatic text box placement within partially completed multimodal layouts by comparing a task-tailored Transformer, a fine-tuned small VLM (), and a large pretrained VLM (Gemini), including an extended Transformer that processes per-element images. On the Crello dataset, the Transformer-based model achieves the best and metrics across single- and multi-text layouts, while Phi3.5-vision outperforms Gemini but remains behind the Transformer; incorporating multiple image inputs further improves performance. The findings emphasize the value of task-specific architectures for capturing spatial relationships in layout design and suggest avenues for improvement, such as joint multi-box optimization, probabilistic placement representations, and ensemble strategies to handle challenging layouts. This work advances automated typographic layout design by clarifying the relative strengths and limitations of specialized transformers versus pretrained VLMs in real-world design tasks.

Abstract

In layout design for advertisements and web pages, balancing visual appeal and communication efficiency is crucial. This study examines automated text box placement in incomplete layouts, comparing a standard Transformer-based method, a small Vision and Language Model (Phi3.5-vision), a large pretrained VLM (Gemini), and an extended Transformer that processes multiple images. Evaluations on the Crello dataset show the standard Transformer-based models generally outperform VLM-based approaches, particularly when incorporating richer appearance information. However, all methods face challenges with very small text or densely populated layouts. These findings highlight the benefits of task-specific architectures and suggest avenues for further improvement in automated layout design.

Paper Structure

This paper contains 25 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: (a) Overview of the text box placement task. (b)-(d) Three machine learning-based methods for solving the task.
  • Figure 2: (a) Details of $N$ input elements. (b) and (c) Models for the optimal text box placement. (d) The details of the JSON format for VLM input. (e) Another version of (b) for utilizing a bitmap-based representation of individual input elements.
  • Figure 3: Distributions of IoU and BDE in all the test set (single text + multiple text). We present IoU and BDE histograms, an IoU-BDE scatter plot, and a corresponding heatmap for each method.
  • Figure 4: Effect of the target text area on placement accuracy.
  • Figure 6: Representative examples where all three methods placed the text box successfully.
  • ...and 2 more figures