Table of Contents
Fetching ...

TextDiffuser: Diffusion Models as Text Painters

Jingye Chen, Yupan Huang, Tengchao Lv, Lei Cui, Qifeng Chen, Furu Wei

TL;DR

TextDiffuser addresses the challenge of rendering accurate, coherent text within diffusion-generated images by introducing a two-stage approach: a Layout Transformer that predicts keyword layouts and character-level segmentation masks, and a latent-diffusion generator conditioned on prompts and the generated layout. It also contributes MARIO-10M, a large OCR-annotated dataset, and MARIO-Eval, a comprehensive benchmark for text rendering quality. Through extensive experiments and user studies, the method achieves superior text readability and coherence with backgrounds, and supports text inpainting and template-based editing. The work advances practical capabilities for text rendering in design-centric imagery and suggests future improvements through higher-resolution backbones and OCR priors.

Abstract

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at \url{https://aka.ms/textdiffuser}.

TextDiffuser: Diffusion Models as Text Painters

TL;DR

TextDiffuser addresses the challenge of rendering accurate, coherent text within diffusion-generated images by introducing a two-stage approach: a Layout Transformer that predicts keyword layouts and character-level segmentation masks, and a latent-diffusion generator conditioned on prompts and the generated layout. It also contributes MARIO-10M, a large OCR-annotated dataset, and MARIO-Eval, a comprehensive benchmark for text rendering quality. Through extensive experiments and user studies, the method achieves superior text readability and coherence with backgrounds, and supports text inpainting and template-based editing. The work advances practical capabilities for text rendering in design-centric imagery and suggests future improvements through higher-resolution backbones and OCR priors.

Abstract

Diffusion models have gained increasing attention for their impressive generation abilities but currently struggle with rendering accurate and coherent text. To address this issue, we introduce TextDiffuser, focusing on generating images with visually appealing text that is coherent with backgrounds. TextDiffuser consists of two stages: first, a Transformer model generates the layout of keywords extracted from text prompts, and then diffusion models generate images conditioned on the text prompt and the generated layout. Additionally, we contribute the first large-scale text images dataset with OCR annotations, MARIO-10M, containing 10 million image-text pairs with text recognition, detection, and character-level segmentation annotations. We further collect the MARIO-Eval benchmark to serve as a comprehensive tool for evaluating text rendering quality. Through experiments and user studies, we show that TextDiffuser is flexible and controllable to create high-quality text images using text prompts alone or together with text template images, and conduct text inpainting to reconstruct incomplete images with text. The code, model, and dataset will be available at \url{https://aka.ms/textdiffuser}.
Paper Structure (51 sections, 5 equations, 27 figures, 8 tables)

This paper contains 51 sections, 5 equations, 27 figures, 8 tables.

Figures (27)

  • Figure 1: TextDiffuser generates accurate and coherent text images from text prompts or together with template images, as well as conducting text inpainting to reconstruct incomplete images.
  • Figure 2: TextDiffuser consists of two stages. In the first Layout Generation stage, a Transformer-based encoder-decoder model generates character-level segmentation masks that indicate the layout of keywords in images from text prompts. In the second Image Generation stage, a diffusion model generates images conditioned on noisy features, segmentation masks, feature masks, and masked features (from left to right) along with text prompts. The feature masks can cover the entire or part of the image, corresponding to whole-image and part-image generation. The diffusion model learns to denoise features progressively with a denoising and character-aware loss. Please note that the diffusion model operates in the latent space, but we use the image pixels for better visualization.
  • Figure 3: Illustrations of three subsets of MARIO-10M. See more details in Appendix C.
  • Figure 4: Visualizations of whole-image generation compared with existing methods. The first three cases are generated from prompts and the last three cases are from given printed template images.
  • Figure 5: Comparison with Character-Aware Model liu2022character and the concurrent GlyphDraw ma2023glyphdraw.
  • ...and 22 more figures