Glyph-ByT5: A Customized Text Encoder for Accurate Visual Text Rendering
Zeyu Liu, Weicong Liang, Zhanhao Liang, Chong Luo, Ji Li, Gao Huang, Yuhui Yuan
TL;DR
The paper tackles the challenge of accurate visual text rendering in diffusion models by proposing a glyph-aware, character-aligned text encoder. It introduces Glyph-ByT5, a glyph-aligned ByT5 fine-tuned on ~1M glyph-text pairs, and integrates it into SDXL as Glyph-SDXL via region-wise cross-attention. A box-level contrastive loss and a glyph-augmentation pipeline drive training, and design-to-scene alignment extends capability to scene-text rendering. On design-image benchmarks, it boosts text-rendering accuracy from less than $20\%$ to nearly $90\%$, and Glyph-SDXL-Scene further improves photorealistic scene text; the work highlights the value of customized text encoders for open-domain text rendering. This sets a path for scalable, specialized encoders in diffusion models.
Abstract
Visual text rendering poses a fundamental challenge for contemporary text-to-image generation models, with the core problem lying in text encoder deficiencies. To achieve accurate text rendering, we identify two crucial requirements for text encoders: character awareness and alignment with glyphs. Our solution involves crafting a series of customized text encoder, Glyph-ByT5, by fine-tuning the character-aware ByT5 encoder using a meticulously curated paired glyph-text dataset. We present an effective method for integrating Glyph-ByT5 with SDXL, resulting in the creation of the Glyph-SDXL model for design image generation. This significantly enhances text rendering accuracy, improving it from less than $20\%$ to nearly $90\%$ on our design image benchmark. Noteworthy is Glyph-SDXL's newfound ability for text paragraph rendering, achieving high spelling accuracy for tens to hundreds of characters with automated multi-line layouts. Finally, through fine-tuning Glyph-SDXL with a small set of high-quality, photorealistic images featuring visual text, we showcase a substantial improvement in scene text rendering capabilities in open-domain real images. These compelling outcomes aim to encourage further exploration in designing customized text encoders for diverse and challenging tasks.
