Table of Contents
Fetching ...

Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, Yuhui Yuan

TL;DR

The paper tackles the challenge of accurate visual text rendering in diffusion models by proposing a glyph-aware, character-aligned text encoder. It introduces Glyph-ByT5, a glyph-aligned ByT5 fine-tuned on ~1M glyph-text pairs, and integrates it into SDXL as Glyph-SDXL via region-wise cross-attention. A box-level contrastive loss and a glyph-augmentation pipeline drive training, and design-to-scene alignment extends capability to scene-text rendering. On design-image benchmarks, it boosts text-rendering accuracy from less than $20\%$ to nearly $90\%$, and Glyph-SDXL-Scene further improves photorealistic scene text; the work highlights the value of customized text encoders for open-domain text rendering. This sets a path for scalable, specialized encoders in diffusion models.

Abstract

Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than $20\%$ to nearly $90\%$ on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.

Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering

TL;DR

The paper tackles the challenge of accurate visual text rendering in diffusion models by proposing a glyph-aware, character-aligned text encoder. It introduces Glyph-ByT5, a glyph-aligned ByT5 fine-tuned on ~1M glyph-text pairs, and integrates it into SDXL as Glyph-SDXL via region-wise cross-attention. A box-level contrastive loss and a glyph-augmentation pipeline drive training, and design-to-scene alignment extends capability to scene-text rendering. On design-image benchmarks, it boosts text-rendering accuracy from less than to nearly , and Glyph-SDXL-Scene further improves photorealistic scene text; the work highlights the value of customized text encoders for open-domain text rendering. This sets a path for scalable, specialized encoders in diffusion models.

Abstract

Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than to nearly on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.
Paper Structure (17 sections, 2 equations, 18 figures, 16 tables)

This paper contains 17 sections, 2 equations, 18 figures, 16 tables.

Figures (18)

  • Figure 1: Illustrating the paragraph rendering capabilities with automatic multi-line layout planning ($1^\text{st}$ row), text-rich design images ($2^\text{nd}$ row), and open-domain images with scene text ($3^\text{rd}$ row), generated with our approach.
  • Figure 2: Illustrating the scheme of glyph augmentation. (a) original glyph. (b) character replacement (Happy $\to$ Hdppy). (c) character repeat (Happy $\to$ Happpppy). (d) character drop (Happy $\to$ Hapy). (e) character add (Graduation $\to$ Gradumation). (f) word replacement (Graduation $\to$ Gauatikn). (g) word repeat (Happy $\to$ Happy Happy). (h) word drop (Happy Graduation Amber $\to$ Graduation).
  • Figure 3: Illustrating the example images with paragraph visual text in our Paragraph-Glyph-Text dataset. From left to right, # of words: 55, 64, 52, 46, 34, 35, 40, 43; # of characters: : 443, 442, 416, 318, 247, 267, 282, 302.
  • Figure 4: Illustrating the glyph-alignment pre-training framework and the region-wise multi-head cross attention module
  • Figure 5: Qualitative comparison results. We show the results generated with our Glyph-SDXL and DALL$\cdot$E3 in the first row and second row, respectively.
  • ...and 13 more figures