Table of Contents
Fetching ...

Multimodal Markup Document Models for Graphic Design Completion

Kotaro Kikuchi, Ukyo Honda, Naoto Inoue, Mayu Otani, Edgar Simo-Serra, Kota Yamaguchi

TL;DR

MarkupDM reframes graphic design as interleaved multimodal documents, enabling unified generation of markup and image content with a custom RGBA image tokenizer. By training with a fill-in-the-middle objective, it can complete missing attributes, text, or images within SVG-like templates, and extend to instruction-guided design completion via the Crello-Instruct dataset. Empirical results show competitive performance across attribute-value, text, and image completion, and the instruction-tuned variant demonstrates favorable results against state-of-the-art image-editing models, especially in textual completion. This work suggests a versatile foundation for broad design automation by integrating structured multimodal representations with large language models.

Abstract

We introduce MarkupDM, a multimodal markup document model that represents graphic design as an interleaved multimodal document consisting of both markup language and images. Unlike existing holistic approaches that rely on an element-by-attribute grid representation, our representation accommodates variable-length elements, type-dependent attributes, and text content. Inspired by fill-in-the-middle training in code generation, we train the model to complete the missing part of a design document from its surrounding context, allowing it to treat various design tasks in a unified manner. Our model also supports image generation by predicting discrete image tokens through a specialized tokenizer with support for image transparency. We evaluate MarkupDM on three tasks, attribute value, image, and text completion, and demonstrate that it can produce plausible designs consistent with the given context. To further illustrate the flexibility of our approach, we evaluate our approach on a new instruction-guided design completion task where our instruction-tuned MarkupDM compares favorably to state-of-the-art image editing models, especially in textual completion. These findings suggest that multimodal language models with our document representation can serve as a versatile foundation for broad design automation.

Multimodal Markup Document Models for Graphic Design Completion

TL;DR

MarkupDM reframes graphic design as interleaved multimodal documents, enabling unified generation of markup and image content with a custom RGBA image tokenizer. By training with a fill-in-the-middle objective, it can complete missing attributes, text, or images within SVG-like templates, and extend to instruction-guided design completion via the Crello-Instruct dataset. Empirical results show competitive performance across attribute-value, text, and image completion, and the instruction-tuned variant demonstrates favorable results against state-of-the-art image-editing models, especially in textual completion. This work suggests a versatile foundation for broad design automation by integrating structured multimodal representations with large language models.

Abstract

We introduce MarkupDM, a multimodal markup document model that represents graphic design as an interleaved multimodal document consisting of both markup language and images. Unlike existing holistic approaches that rely on an element-by-attribute grid representation, our representation accommodates variable-length elements, type-dependent attributes, and text content. Inspired by fill-in-the-middle training in code generation, we train the model to complete the missing part of a design document from its surrounding context, allowing it to treat various design tasks in a unified manner. Our model also supports image generation by predicting discrete image tokens through a specialized tokenizer with support for image transparency. We evaluate MarkupDM on three tasks, attribute value, image, and text completion, and demonstrate that it can produce plausible designs consistent with the given context. To further illustrate the flexibility of our approach, we evaluate our approach on a new instruction-guided design completion task where our instruction-tuned MarkupDM compares favorably to state-of-the-art image editing models, especially in textual completion. These findings suggest that multimodal language models with our document representation can serve as a versatile foundation for broad design automation.
Paper Structure (27 sections, 13 figures, 5 tables)

This paper contains 27 sections, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Our MarkupDM is based on causal multimodal LLM, with separate embedding layers and prediction heads dedicated to images and text tokens.
  • Figure 2: Our image tokenizer is trained by reconstructing images resized to a fixed size. When decoding, the image size is given in addition to the image tokens.
  • Figure 3: Examples of our Crello-Instruct dataset.
  • Figure 4: Image reconstruction results.
  • Figure 5: Text completion results. Each pair shows the predicted completion and the original design from left to right or top to bottom. The green boxes indicate the target text and some of them are zoomed in for better visibility.
  • ...and 8 more figures