Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training
Wenbo Li, Guohao Li, Zhibin Lan, Xue Xu, Wanru Zhuang, Jiachen Liu, Xinyan Xiao, Jinsong Su
TL;DR
This work identifies two bottlenecks in diffusion-based text-to-image backbones for visual text generation: BPE subword fragmentation and weak cross-attention binding for glyph tokens. It introduces a mixed granularity input strategy to treat glyph words as whole units and a glyph-aware training framework with three losses—$\mathcal{L}_{attn}$, $\mathcal{L}_{loc}$, and $\mathcal{L}_{ocr}$—to improve alignment between text tokens and visual glyphs and to improve OCR accuracy. The proposed approach yields semantically relevant, aesthetically appealing, and legible visual texts in English and Chinese while preserving core image quality, demonstrated through English and Chinese datasets, quantitative metrics (CLIP, OCR precision/recall/F1, etc.), and user studies. This has practical impact for AI art and content creation, enabling more reliable generation of visual texts across languages, with potential extensions to additional scripts and glyph-aware diffusion tasks.
Abstract
Diffusion-based text-to-image models have demonstrated impressive achievements in diversity and aesthetics but struggle to generate images with legible visual texts. Existing backbone models have limitations such as misspelling, failing to generate texts, and lack of support for Chinese text, but their development shows promising potential. In this paper, we propose a series of methods, aiming to empower backbone models to generate visual texts in English and Chinese. We first conduct a preliminary study revealing that Byte Pair Encoding (BPE) tokenization and the insufficient learning of cross-attention modules restrict the performance of the backbone models. Based on these observations, we make the following improvements: (1) We design a mixed granularity input strategy to provide more suitable text representations; (2) We propose to augment the conventional training objective with three glyph-aware training losses, which enhance the learning of cross-attention modules and encourage the model to focus on visual texts. Through experiments, we demonstrate that our methods can effectively empower backbone models to generate semantic relevant, aesthetically appealing, and accurate visual text images, while maintaining their fundamental image generation quality.
