Table of Contents
Fetching ...

ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models

Jianyi Zhang, Yufan Zhou, Jiuxiang Gu, Curtis Wigington, Tong Yu, Yiran Chen, Tong Sun, Ruiyi Zhang

TL;DR

ARTIST addresses the challenge of rendering legible, well-integrated text in diffusion-generated images by introducing a disentangled, two-stage framework: a textual diffusion module that learns text structure and a visual diffusion module that learns appearance, guided by an LLM-based prompt understanding component. The method trains the text module on large synthetic word- and sentence-level datasets and then trains the visual module with injected features from the text module, using diffusion losses. Empirical results on MARIO-Eval and ARTIST-Eval demonstrate substantial gains in OCR accuracy, CLIP alignment, and layout fidelity, with additional improvements when leveraging LLM prompting. This approach automates keyword extraction and layout planning, enabling more accurate text rendering in posters, book covers, and other text-rich imagery, and offers a scalable path for future improvements in text-aware image synthesis.

Abstract

Diffusion models have demonstrated exceptional capabilities in generating a broad spectrum of visual content, yet their proficiency in rendering text is still limited: they often generate inaccurate characters or words that fail to blend well with the underlying image. To address these shortcomings, we introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically. Initially, we pretrain this textual model to capture the intricacies of text representation. Subsequently, we finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model. This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation. Additionally, we leverage the capabilities of pretrained large language models to interpret user intentions better, contributing to improved generation quality. Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.

ARTIST: Improving the Generation of Text-rich Images with Disentangled Diffusion Models and Large Language Models

TL;DR

ARTIST addresses the challenge of rendering legible, well-integrated text in diffusion-generated images by introducing a disentangled, two-stage framework: a textual diffusion module that learns text structure and a visual diffusion module that learns appearance, guided by an LLM-based prompt understanding component. The method trains the text module on large synthetic word- and sentence-level datasets and then trains the visual module with injected features from the text module, using diffusion losses. Empirical results on MARIO-Eval and ARTIST-Eval demonstrate substantial gains in OCR accuracy, CLIP alignment, and layout fidelity, with additional improvements when leveraging LLM prompting. This approach automates keyword extraction and layout planning, enabling more accurate text rendering in posters, book covers, and other text-rich imagery, and offers a scalable path for future improvements in text-aware image synthesis.

Abstract

Diffusion models have demonstrated exceptional capabilities in generating a broad spectrum of visual content, yet their proficiency in rendering text is still limited: they often generate inaccurate characters or words that fail to blend well with the underlying image. To address these shortcomings, we introduce a novel framework named, ARTIST, which incorporates a dedicated textual diffusion model to focus on the learning of text structures specifically. Initially, we pretrain this textual model to capture the intricacies of text representation. Subsequently, we finetune a visual diffusion model, enabling it to assimilate textual structure information from the pretrained textual model. This disentangled architecture design and training strategy significantly enhance the text rendering ability of the diffusion models for text-rich image generation. Additionally, we leverage the capabilities of pretrained large language models to interpret user intentions better, contributing to improved generation quality. Empirical results on the MARIO-Eval benchmark underscore the effectiveness of the proposed method, showing an improvement of up to 15% in various metrics.
Paper Structure (27 sections, 2 equations, 11 figures, 5 tables)

This paper contains 27 sections, 2 equations, 11 figures, 5 tables.

Figures (11)

  • Figure 1: Generated examples from our ARTIST. Our framework adeptly identifies the text intended to be generated in the image from the given prompts, regardless of explicit marking by quotes. The generated text is legible and complements the visual elements, enhancing the overall coherence of the design.
  • Figure 2: Illustration of the proposed ARTIST framework. A large-language model (LLM) is utilized to analyze the user's intention. Two diffusion models will be trained to learn text structure and other visual appearance respectively. Given a user input, the LLM will output keywords, layout and text prompts, which will be fed into our trainable modules to generate target images.
  • Figure 3: Generated examples from our text module, along with input masks.
  • Figure 4: Comparison with TextDiffuser on MARIO-Eval benchmark. Layout generated by TextDiffuser is used as input conditions for both models for fair comparison.
  • Figure 5: Generated examples in inpainting task, where the masked regions are indicated by red rectangles. Prompts used are "a book cover of Bed Times", "a book cover for Kansas State", "a poster for A Good Princess" and "a poster for The Man Who Said YES".
  • ...and 6 more figures