Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

Wenbo Li; Guohao Li; Zhibin Lan; Xue Xu; Wanru Zhuang; Jiachen Liu; Xinyan Xiao; Jinsong Su

Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

Wenbo Li, Guohao Li, Zhibin Lan, Xue Xu, Wanru Zhuang, Jiachen Liu, Xinyan Xiao, Jinsong Su

TL;DR

This work identifies two bottlenecks in diffusion-based text-to-image backbones for visual text generation: BPE subword fragmentation and weak cross-attention binding for glyph tokens. It introduces a mixed granularity input strategy to treat glyph words as whole units and a glyph-aware training framework with three losses—$\mathcal{L}_{attn}$, $\mathcal{L}_{loc}$, and $\mathcal{L}_{ocr}$—to improve alignment between text tokens and visual glyphs and to improve OCR accuracy. The proposed approach yields semantically relevant, aesthetically appealing, and legible visual texts in English and Chinese while preserving core image quality, demonstrated through English and Chinese datasets, quantitative metrics (CLIP, OCR precision/recall/F1, etc.), and user studies. This has practical impact for AI art and content creation, enabling more reliable generation of visual texts across languages, with potential extensions to additional scripts and glyph-aware diffusion tasks.

Abstract

Diffusion-based text-to-image models have demonstrated impressive achievements in diversity and aesthetics but struggle to generate images with legible visual texts. Existing backbone models have limitations such as misspelling, failing to generate texts, and lack of support for Chinese text, but their development shows promising potential. In this paper, we propose a series of methods, aiming to empower backbone models to generate visual texts in English and Chinese. We first conduct a preliminary study revealing that Byte Pair Encoding (BPE) tokenization and the insufficient learning of cross-attention modules restrict the performance of the backbone models. Based on these observations, we make the following improvements: (1) We design a mixed granularity input strategy to provide more suitable text representations; (2) We propose to augment the conventional training objective with three glyph-aware training losses, which enhance the learning of cross-attention modules and encourage the model to focus on visual texts. Through experiments, we demonstrate that our methods can effectively empower backbone models to generate semantic relevant, aesthetically appealing, and accurate visual text images, while maintaining their fundamental image generation quality.

Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

TL;DR

, and

—to improve alignment between text tokens and visual glyphs and to improve OCR accuracy. The proposed approach yields semantically relevant, aesthetically appealing, and legible visual texts in English and Chinese while preserving core image quality, demonstrated through English and Chinese datasets, quantitative metrics (CLIP, OCR precision/recall/F1, etc.), and user studies. This has practical impact for AI art and content creation, enabling more reliable generation of visual texts across languages, with potential extensions to additional scripts and glyph-aware diffusion tasks.

Abstract

Paper Structure (31 sections, 9 equations, 13 figures, 6 tables)

This paper contains 31 sections, 9 equations, 13 figures, 6 tables.

Introduction
Related Work
Visual Text Generation
Text-to-Image Backbone Models
Preliminary Study
Diffusion Based Text-to-Image Backbone Models
Experimental Analyses
Methods
Mixed Granularity Input
Glyph-Aware Training
Attention Alignment Loss $\mathcal{L}_{attn}$
Local MSE Loss $\mathcal{L}_{loc}$
OCR Recognition Loss $\mathcal{L}_{ocr}$
Experiments
Dataset
...and 16 more sections

Figures (13)

Figure 1: Comparison between the backbone models (top) and our models (bottom). Our methods can empower the backbone models to generate complex (top left), artistic (top right) visual texts while maintaining fundamental image generation quality (bottom left). Besides, our method can be transferred to Chinese text generation (bottom right).
Figure 2: Visual text generation results of our models. Our methods significantly empower the backbone models to generate semantic relevant, visual appealing visual text images generation in English and Chinese.
Figure 3: Visualization of the cross-attention maps. (a): "University" is correctly spelled, the token has large values on the corresponding areas. (b): "University" is not correctly spelled, token "university</w>" fails to focus on the corresponding area. (c): The token "heart</w>" attends to the corresponding area, thus is correctly generated, while the token "flower</w>" highlights irrelevant region and fails to generate the corresponding visual text.
Figure 4: The framework of our methods. The Mixed Granularity Input strategy considers glyph words as whole units to provide more suitable text representations. The Glyph Aware Training includes three losses: (1) the attention alignment loss enhances the learning of cross-attention modules; (2) the local MSE loss highlights the importance of visual text areas; (3) the OCR recognition loss encourages the model to generate accurate visual texts.
Figure 5: Mixed granularity input. The word "diffusion" is considered as a whole instead of being tokenized.
...and 8 more figures

Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

TL;DR

Abstract

Empowering Backbone Models for Visual Text Generation with Input Granularity Control and Glyph-Aware Training

Authors

TL;DR

Abstract

Table of Contents

Figures (13)