Automatic Text Box Placement for Supporting Typographic Design

Jun Muraoka; Daichi Haraguchi; Naoto Inoue; Wataru Shimoda; Kota Yamaguchi; Seiichi Uchida

Automatic Text Box Placement for Supporting Typographic Design

Jun Muraoka, Daichi Haraguchi, Naoto Inoue, Wataru Shimoda, Kota Yamaguchi, Seiichi Uchida

TL;DR

This work addresses automatic text box placement within partially completed multimodal layouts by comparing a task-tailored Transformer, a fine-tuned small VLM ($\text{Phi3.5-vision}$), and a large pretrained VLM (Gemini), including an extended Transformer that processes per-element images. On the Crello dataset, the Transformer-based model achieves the best $IoU$ and $BDE$ metrics across single- and multi-text layouts, while Phi3.5-vision outperforms Gemini but remains behind the Transformer; incorporating multiple image inputs further improves performance. The findings emphasize the value of task-specific architectures for capturing spatial relationships in layout design and suggest avenues for improvement, such as joint multi-box optimization, probabilistic placement representations, and ensemble strategies to handle challenging layouts. This work advances automated typographic layout design by clarifying the relative strengths and limitations of specialized transformers versus pretrained VLMs in real-world design tasks.

Abstract

In layout design for advertisements and web pages, balancing visual appeal and communication efficiency is crucial. This study examines automated text box placement in incomplete layouts, comparing a standard Transformer-based method, a small Vision and Language Model (Phi3.5-vision), a large pretrained VLM (Gemini), and an extended Transformer that processes multiple images. Evaluations on the Crello dataset show the standard Transformer-based models generally outperform VLM-based approaches, particularly when incorporating richer appearance information. However, all methods face challenges with very small text or densely populated layouts. These findings highlight the benefits of task-specific architectures and suggest avenues for further improvement in automated layout design.

Automatic Text Box Placement for Supporting Typographic Design

TL;DR

This work addresses automatic text box placement within partially completed multimodal layouts by comparing a task-tailored Transformer, a fine-tuned small VLM (

), and a large pretrained VLM (Gemini), including an extended Transformer that processes per-element images. On the Crello dataset, the Transformer-based model achieves the best

and

metrics across single- and multi-text layouts, while Phi3.5-vision outperforms Gemini but remains behind the Transformer; incorporating multiple image inputs further improves performance. The findings emphasize the value of task-specific architectures for capturing spatial relationships in layout design and suggest avenues for improvement, such as joint multi-box optimization, probabilistic placement representations, and ensemble strategies to handle challenging layouts. This work advances automated typographic layout design by clarifying the relative strengths and limitations of specialized transformers versus pretrained VLMs in real-world design tasks.

Automatic Text Box Placement for Supporting Typographic Design

TL;DR

Abstract

Automatic Text Box Placement for Supporting Typographic Design

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)